David Heineman
@davidheineman.com
Pre-doc @ai2.bsky.social
davidheineman.com
davidheineman.com
(6/6) A huge thanks to my collaborators! @valentinhofmann.bsky.social @ianmagnusson.bsky.social Yuling Gu @nlpnoah.bsky.social @hanna-nlp.bsky.social @kylelo.bsky.social @jessedodge.bsky.social
📄: arxiv.org/abs/2508.13144
📝: allenai.org/blog/signal-noise
💻: github.com/allenai/signal-and-noise
📄: arxiv.org/abs/2508.13144
📝: allenai.org/blog/signal-noise
💻: github.com/allenai/signal-and-noise
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific pr...
arxiv.org
August 19, 2025 at 4:46 PM
(6/6) A huge thanks to my collaborators! @valentinhofmann.bsky.social @ianmagnusson.bsky.social Yuling Gu @nlpnoah.bsky.social @hanna-nlp.bsky.social @kylelo.bsky.social @jessedodge.bsky.social
📄: arxiv.org/abs/2508.13144
📝: allenai.org/blog/signal-noise
💻: github.com/allenai/signal-and-noise
📄: arxiv.org/abs/2508.13144
📝: allenai.org/blog/signal-noise
💻: github.com/allenai/signal-and-noise
(5/6) SNR naturally gives a way to improve benchmarks, we introduce 3 “interventions” in our work! For example:
❗️ Simply using the top 16 MMLU subtasks by SNR exhibits better decision accuracy and lower scaling law error than using the full task (only 6 for an AutoBencher task)
❗️ Simply using the top 16 MMLU subtasks by SNR exhibits better decision accuracy and lower scaling law error than using the full task (only 6 for an AutoBencher task)
August 19, 2025 at 4:46 PM
(5/6) SNR naturally gives a way to improve benchmarks, we introduce 3 “interventions” in our work! For example:
❗️ Simply using the top 16 MMLU subtasks by SNR exhibits better decision accuracy and lower scaling law error than using the full task (only 6 for an AutoBencher task)
❗️ Simply using the top 16 MMLU subtasks by SNR exhibits better decision accuracy and lower scaling law error than using the full task (only 6 for an AutoBencher task)
(4/6) 🧐 How do we know SNR is meaningful? We can (1) calculate % of models ranked correctly at small vs. 1B scale and (2) fit scaling laws to predict the task performance.
SNR is predictive of better decision accuracy, and tasks with lower noise have lower scaling law error!
SNR is predictive of better decision accuracy, and tasks with lower noise have lower scaling law error!
August 19, 2025 at 4:46 PM
(4/6) 🧐 How do we know SNR is meaningful? We can (1) calculate % of models ranked correctly at small vs. 1B scale and (2) fit scaling laws to predict the task performance.
SNR is predictive of better decision accuracy, and tasks with lower noise have lower scaling law error!
SNR is predictive of better decision accuracy, and tasks with lower noise have lower scaling law error!
(3/6) 🔎 We landed on a simple metric - the signal-to-noise ratio - the ratio between the dispersion of scores from models, and the variation of final checkpoints of a single model.
This allows estimating SNR with a small number of models (around 50 models) at any compute scale!
This allows estimating SNR with a small number of models (around 50 models) at any compute scale!
August 19, 2025 at 4:46 PM
(3/6) 🔎 We landed on a simple metric - the signal-to-noise ratio - the ratio between the dispersion of scores from models, and the variation of final checkpoints of a single model.
This allows estimating SNR with a small number of models (around 50 models) at any compute scale!
This allows estimating SNR with a small number of models (around 50 models) at any compute scale!
(2/6) Consider these training curves: 150M, 300M and 1B param models on 25 pretraining corpora. Many benchmarks can separate models, but are too noisy, and vice versa! 😧
We want – ⭐ low noise and high signal ⭐ – *both* low variance during training and a high spread of scores.
We want – ⭐ low noise and high signal ⭐ – *both* low variance during training and a high spread of scores.
August 19, 2025 at 4:46 PM
(2/6) Consider these training curves: 150M, 300M and 1B param models on 25 pretraining corpora. Many benchmarks can separate models, but are too noisy, and vice versa! 😧
We want – ⭐ low noise and high signal ⭐ – *both* low variance during training and a high spread of scores.
We want – ⭐ low noise and high signal ⭐ – *both* low variance during training and a high spread of scores.