Lightnews — Scholar-powered news

David Heineman

@davidheineman.com

28 followers 180 following 6 posts

Pre-doc @ai2.bsky.social
davidheineman.com

Posts Replies Media Videos

David Heineman

@davidheineman.com

Evaluating language models is tricky, how do we know if our results are real, or due to random chance?

We find an answer with two simple metrics: signal, a benchmark’s ability to separate models, and noise, a benchmark’s random variability between training steps 🧵

August 19, 2025 at 4:46 PM

Reposted by David Heineman

Ai2

@ai2.bsky.social

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling.

The RewardBench 2 Leaderboard on HuggingFace.

June 2, 2025 at 4:31 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news