Gabe Grand
banner
gabegrand.bsky.social
Gabe Grand
@gabegrand.bsky.social
PhD student @csail.mit.edu 🤖 & 🧠
Paper + code + interactive demos: gabegrand.github.io/battleship ⚓️🎯
October 27, 2025 at 7:17 PM
Does this generalize? YES. We replicated on "Guess Who?" from TextArena and saw similar gains: GPT-4o (61.7% → 90.0%), Llama-4-Scout (30.0% → 72.4%). The framework works across information-seeking domains with combinatorial hypothesis spaces.
October 27, 2025 at 7:17 PM
Deciding when to explore vs. act is also key. Skilled players (humans + GPT-5) spread out questions over the course of the game. Weak LMs spam all 15 upfront. The key isn't asking MORE—it's asking BETTER questions at the RIGHT time. Quality > quantity.
October 27, 2025 at 7:17 PM
Our approach leverages inference scaling to enable models to ask more informative questions. Bayes-Q boosts EIG by up to 0.227 bits (94.2% of the theoretical ceiling) and virtually eliminates redundant questions (18.5% → 0.2% for Llama-4-Scout).
October 27, 2025 at 7:17 PM
In head-to-head comparisons, both GPT-4o and Llama-4-Scout now beat GPT-5 while costing 2.8x and 99.7x less, respectively.
October 27, 2025 at 7:17 PM
With all three Bayesian components (+Bayes-QMD), Llama-4-Scout jumps from near-random guessing (0.367 F1) to super-human level (0.764 F1). GPT-4o sees similar gains (0.450 → 0.782 F1). The deltas are really striking.
October 27, 2025 at 7:17 PM
We find that having models write Python functions to answer questions boosts accuracy by +14.7% (absolute p.p.), and complements CoT reasoning.
October 27, 2025 at 7:17 PM
One useful trick to improve answering accuracy is to use code generation. Code grounds reasoning in executable logic, not just vibes.
October 27, 2025 at 7:17 PM
Many LMs really struggle with questions that require grounding answers in the board and dialogue context. GPT-4o drops from 72.8% → 60.4% accuracy on context-dependent questions. Llama-4-Scout: 68.0% → 54.0%. Humans? Basically flat (92.8% vs 91.9%).
October 27, 2025 at 7:17 PM
Overall, humans are really reliable at answering questions on BattleshipQA (92.5% accuracy). In contrast, LM accuracy ranges widely—from near-random (52.5%, GPT-4o-mini) to human-level (92.8%, o3-mini). But there's a catch…
October 27, 2025 at 7:17 PM
To understand how people strategize & collaborate, we ran a two-player synchronous human study (N=42) and collected full action trajectories and chat dialogues. Our “BattleshipQA” dataset provides a rich, multimodal benchmark for comparing human and agent behavior.
October 27, 2025 at 7:17 PM
Do AI agents ask good questions? We built “Collaborative Battleship” to find out—and discovered that weaker LMs + Bayesian inference can beat GPT-5 at 1% of the cost.

Paper, code & demos: gabegrand.github.io/battleship

Here's what we learned about building rational information-seeking agents... 🧵🔽
October 27, 2025 at 7:17 PM