Would be exciting to run more experiments on this!
Would be exciting to run more experiments on this!
📄 arxiv.org/abs/2509.11106
✍️ allenai.org/blog/fluid-b...
💻 github.com/allenai/flui...
📊 huggingface.co/datasets/all...
Looking forward to chatting more at #COLM2025! 👋
📄 arxiv.org/abs/2509.11106
✍️ allenai.org/blog/fluid-b...
💻 github.com/allenai/flui...
📊 huggingface.co/datasets/all...
Looking forward to chatting more at #COLM2025! 👋
Example: on MMLU, Fluid Benchmarking results in lower step-to-step variance and higher validity than standard methods while using 50 times fewer questions. ⚡
Example: on MMLU, Fluid Benchmarking results in lower step-to-step variance and higher validity than standard methods while using 50 times fewer questions. ⚡
It also increases validity: results generalize better to other benchmarks targeting the same capability. One reason: it automatically avoids mislabeled questions, cutting label errors by 99%! 🤯
It also increases validity: results generalize better to other benchmarks targeting the same capability. One reason: it automatically avoids mislabeled questions, cutting label errors by 99%! 🤯
We find that Fluid Benchmarking dynamically adapts to these changes, administering easier questions early in training and more difficult ones later.
We find that Fluid Benchmarking dynamically adapts to these changes, administering easier questions early in training and more difficult ones later.
Adaptive question selection means that LLMs face different sets of questions, but ability estimation aligns results in a common space.
Adaptive question selection means that LLMs face different sets of questions, but ability estimation aligns results in a common space.
To select the next question, we use Fisher information. Essentially: a question close in difficulty (b) to the ability estimate (θ) and with high discrimination (a).
Then we update the estimate.
To select the next question, we use Fisher information. Essentially: a question close in difficulty (b) to the ability estimate (θ) and with high discrimination (a).
Then we update the estimate.
The IRT ability estimate can be used to summarize performance like accuracy, and it accounts for question characteristics.
The IRT ability estimate can be used to summarize performance like accuracy, and it accounts for question characteristics.
IRT also measures the discrimination of a question, meaning how reliably it separates stronger from weaker LLMs.
IRT also measures the discrimination of a question, meaning how reliably it separates stronger from weaker LLMs.
For LLMs, that means evaluating weaker models on easier questions and stronger models on harder ones.
But how do we know a question's difficulty, or an LLM's ability, before evaluation? 🤔
For LLMs, that means evaluating weaker models on easier questions and stronger models on harder ones.
But how do we know a question's difficulty, or an LLM's ability, before evaluation? 🤔