Google Brain. ML Efficiency, LLMs,
@trustworthy_ml.
If you made it this far, take a look at the full 68 pages: arxiv.org/abs/2504.20879
Any feedback or corrections are of course very welcome.
If you made it this far, take a look at the full 68 pages: arxiv.org/abs/2504.20879
Any feedback or corrections are of course very welcome.
Shivalika Singh and @mziizm.bsky.social with Yiyang Nan, Alex Wang, Daniel D'Souza, @sayash.bsky.social, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, @shaynelongpre.bsky.social
@nlpnoah.bsky.social @beyzaermis.bsky.social
Shivalika Singh and @mziizm.bsky.social with Yiyang Nan, Alex Wang, Daniel D'Souza, @sayash.bsky.social, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, @shaynelongpre.bsky.social
@nlpnoah.bsky.social @beyzaermis.bsky.social
As scientists, we must do better.
As a community, I hope we can demand better. We make very clear the 5 changes needed.
As scientists, we must do better.
As a community, I hope we can demand better. We make very clear the 5 changes needed.
Arena towards the same small group have created conditions to overfit to Arena-specific dynamics rather than general model quality.
Arena towards the same small group have created conditions to overfit to Arena-specific dynamics rather than general model quality.
While using Arena-style data in training boosts win rates by 112%, this improvement doesn't transfer to tasks like MMLU, indicating overfitting to Arena's quirks rather than general performance gains.
While using Arena-style data in training boosts win rates by 112%, this improvement doesn't transfer to tasks like MMLU, indicating overfitting to Arena's quirks rather than general performance gains.
1) proprietary models sampled at higher rates to appear in battles 📶
2) open-weights + open-source models removed from Arena more often 🚮
3) How many private variants 🔍
1) proprietary models sampled at higher rates to appear in battles 📶
2) open-weights + open-source models removed from Arena more often 🚮
3) How many private variants 🔍
Chatbot Arena is a open community resource that provides free feedback but 61.3% of all data goes to proprietary model providers.
Chatbot Arena is a open community resource that provides free feedback but 61.3% of all data goes to proprietary model providers.
Even when you test identical checkpoints we see gains. This is the most conservative case where quality is identical.
Even when you test identical checkpoints we see gains. This is the most conservative case where quality is identical.
Being able to choose the best score to disclose enables systematic gaming of Arena score.
This advantage increases with number of variants and if all other providers don’t know they can also private test.
Being able to choose the best score to disclose enables systematic gaming of Arena score.
This advantage increases with number of variants and if all other providers don’t know they can also private test.
Providers can choose what score to disclose and retract all others.
At an extreme, we see testing of up to 27 models in lead up to releases.
Providers can choose what score to disclose and retract all others.
At an extreme, we see testing of up to 27 models in lead up to releases.
We show that preferential policies engaged in by a handful of providers lead to overfitting to Arena-specific metrics rather than genuine AI progress.
We show that preferential policies engaged in by a handful of providers lead to overfitting to Arena-specific metrics rather than genuine AI progress.
Our policy primer explores ways to move towards more sustainable AI. 🌱
📜 cohere.com/research/pap...
Our policy primer explores ways to move towards more sustainable AI. 🌱
📜 cohere.com/research/pap...
In this work led by Sara Hooker, we seek to understand the viability of compute thresholds ⚖️ as a way to mitigate risk. 🦺
arxiv.org/abs/2407.05694
In this work led by Sara Hooker, we seek to understand the viability of compute thresholds ⚖️ as a way to mitigate risk. 🦺
arxiv.org/abs/2407.05694
📜https://arxiv.org/abs/2410.10801
📜https://arxiv.org/abs/2410.10801