Sara Hooker
sarahooker.bsky.social
Sara Hooker
@sarahooker.bsky.social
I lead Cohere For AI. Formerly Research
Google Brain. ML Efficiency, LLMs,
@trustworthy_ml.
We tried very hard to get this right, and have spent the last 5 months working carefully to ensure rigor.

If you made it this far, take a look at the full 68 pages: arxiv.org/abs/2504.20879

Any feedback or corrections are of course very welcome.
April 30, 2025 at 2:58 PM
Very proud of this work that we led by
Shivalika Singh and @mziizm.bsky.social with Yiyang Nan, Alex Wang, Daniel D'Souza, @sayash.bsky.social, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, @shaynelongpre.bsky.social
@nlpnoah.bsky.social @beyzaermis.bsky.social
April 30, 2025 at 2:57 PM
This was an uncomfortable paper to work on because it asks us to look in the mirror as a community.

As scientists, we must do better.

As a community, I hope we can demand better. We make very clear the 5 changes needed.
April 30, 2025 at 2:55 PM
Overall, our work suggests that engagement from a handful of providers and preferential policies from
Arena towards the same small group have created conditions to overfit to Arena-specific dynamics rather than general model quality.
April 30, 2025 at 2:55 PM
We show that access to Chatbot Arena data yields substantial benefits.

While using Arena-style data in training boosts win rates by 112%, this improvement doesn't transfer to tasks like MMLU, indicating overfitting to Arena's quirks rather than general performance gains.
April 30, 2025 at 2:55 PM
These data differences stem from some key policies that benefit a handful of providers:

1) proprietary models sampled at higher rates to appear in battles 📶
2) open-weights + open-source models removed from Arena more often 🚮
3) How many private variants 🔍
April 30, 2025 at 2:55 PM
We also observe large differences in Arena Data Access

Chatbot Arena is a open community resource that provides free feedback but 61.3% of all data goes to proprietary model providers.
April 30, 2025 at 2:55 PM
We even do real world private testing using Aya Vision models to show the gains you can expect.

Even when you test identical checkpoints we see gains. This is the most conservative case where quality is identical.
April 30, 2025 at 2:55 PM
There is no reasonable scientific justification for this practice.

Being able to choose the best score to disclose enables systematic gaming of Arena score.

This advantage increases with number of variants and if all other providers don’t know they can also private test.
April 30, 2025 at 2:55 PM
There is an unspoken policy of hidden testing that benefits a small subset of providers.

Providers can choose what score to disclose and retract all others.

At an extreme, we see testing of up to 27 models in lead up to releases.
April 30, 2025 at 2:55 PM
We spent 5 months analyzing 2.8M battles on the Arena, covering 238 models across 43 providers.

We show that preferential policies engaged in by a handful of providers lead to overfitting to Arena-specific metrics rather than genuine AI progress.
April 30, 2025 at 2:55 PM
Reposted by Sara Hooker
An important topic in AI is the climate impacts of the energy-intensive computing hardware needed to train and deploy AI models ⚡

Our policy primer explores ways to move towards more sustainable AI. 🌱

📜 cohere.com/research/pap...
February 25, 2025 at 5:42 PM
Reposted by Sara Hooker
Does more compute equate with greater risk?⚡️What is our track record predicting what risks emerge with scale? 📈

In this work led by Sara Hooker, we seek to understand the viability of compute thresholds ⚖️ as a way to mitigate risk. 🦺

arxiv.org/abs/2407.05694
February 11, 2025 at 3:11 PM
Reposted by Sara Hooker
In this work, we ask "How does model merging stack up when optimizing language models for diverse multitask learning?" 📚🧩

📜https://arxiv.org/abs/2410.10801
February 18, 2025 at 4:38 PM