Special thanks to Paul Martin, and Arcee AI's Mergekit library.
Special thanks to Paul Martin, and Arcee AI's Mergekit library.
Model merging provides a simple, powerful dial to find the perfect balance of capability and reliability for YOUR application.
Paper here: arxiv.org/abs/2510.17426 (8/8)
Model merging provides a simple, powerful dial to find the perfect balance of capability and reliability for YOUR application.
Paper here: arxiv.org/abs/2510.17426 (8/8)
The benefits of merging, both the accuracy boost and the stability of the "sweet spot", become even more pronounced in larger, more capable models. This echoes prior work which shows merging bigger models are more effective and stable. (6/8)
The benefits of merging, both the accuracy boost and the stability of the "sweet spot", become even more pronounced in larger, more capable models. This echoes prior work which shows merging bigger models are more effective and stable. (6/8)
We find a "sweet spot" merge that is Pareto-superior: it has HIGHER accuracy than both parents while substantially restoring the calibration lost during alignment. (4/8)
We find a "sweet spot" merge that is Pareto-superior: it has HIGHER accuracy than both parents while substantially restoring the calibration lost during alignment. (4/8)
By interpolating between the well-calibrated base model and its capable but overconfident instruct counterpart, we create a continuous spectrum to navigate this trade-off. No retraining needed.
(3/8)
By interpolating between the well-calibrated base model and its capable but overconfident instruct counterpart, we create a continuous spectrum to navigate this trade-off. No retraining needed.
(3/8)
Instruction tuning doesn't just nudge performance; it wrecks calibration, causing a huge spike in overconfidence. (2/8)
Instruction tuning doesn't just nudge performance; it wrecks calibration, causing a huge spike in overconfidence. (2/8)
@cambridgeltl.bsky.social
Work partially done during my visit to @milanlp.bsky.social. Highly recommended!
@cambridgeltl.bsky.social
Work partially done during my visit to @milanlp.bsky.social. Highly recommended!
Paper: arxiv.org/abs/2510.17516
Data: huggingface.co/datasets/pit...
Website: simbench.tiancheng.hu (9/9)
Paper: arxiv.org/abs/2510.17516
Data: huggingface.co/datasets/pit...
Website: simbench.tiancheng.hu (9/9)
Moving forward, we should work on better training strategies for improving LLM social simulators. These will most likely diverge from advances in chat / coding models. (8/9)
Moving forward, we should work on better training strategies for improving LLM social simulators. These will most likely diverge from advances in chat / coding models. (8/9)
To simulate humans well, a model needs a broad, nuanced understanding of the world. (7/9)
To simulate humans well, a model needs a broad, nuanced understanding of the world. (7/9)
✅ a helpful direct effect (+6.46 score): models get much better at following instructions
❌ a harmful indirect effect (-1.74 score): models become less diverse
The challenge: how do we get the good without the bad? (6/9)
✅ a helpful direct effect (+6.46 score): models get much better at following instructions
❌ a harmful indirect effect (-1.74 score): models become less diverse
The challenge: how do we get the good without the bad? (6/9)
Instruction-tuning (the process that makes LLMs helpful and safe) improves their ability to predict consensus opinions.
BUT, it actively harms their ability to predict diverse, pluralistic opinions where humans disagree. (5/9)
Instruction-tuning (the process that makes LLMs helpful and safe) improves their ability to predict consensus opinions.
BUT, it actively harms their ability to predict diverse, pluralistic opinions where humans disagree. (5/9)
Across the model families we could test, bigger models are consistently better simulators. Performance reliably increases with model size. This suggests that future, larger models hold the potential to become highly accurate simulators. (4/9)
Across the model families we could test, bigger models are consistently better simulators. Performance reliably increases with model size. This suggests that future, larger models hold the potential to become highly accurate simulators. (4/9)
It spans moral dilemmas, economic games, psych assessments & more to rigorously test how well LLMs can predict group-level human responses across a wide range of tasks. (2/9)
It spans moral dilemmas, economic games, psych assessments & more to rigorously test how well LLMs can predict group-level human responses across a wide range of tasks. (2/9)