Lightnews — Scholar-powered news

Nathan

@saylortwift.hf.co

440 followers 160 following 41 posts

ML engineer at @huggingface 🤗, Evaluation, Open LLM Leaderboard and lighteval

Posts Replies Media Videos

Nathan

@saylortwift.hf.co

Here is an example of question and answer from Claude 3.7 Sonnet

2/N

April 22, 2025 at 2:29 PM

Nathan

@saylortwift.hf.co

openai really has some nice benchmarks, one of them being simpleqa. a simple fact-checking benchmark, short questions and straight answers

i've been using @huggingface's lighteval and inference providers and litellm to evaluate all those models in less than a few hours 🤩

1/N

April 22, 2025 at 2:29 PM

Nathan

@saylortwift.hf.co

🚀 Just dropped fresh benchmarks for LLaMA 4 Scout and Maverick using Lighteval!

Details below👇

1/6

April 8, 2025 at 8:53 AM

Nathan

@saylortwift.hf.co

🚀 Introducing ✨ YourBench ✨ ! Build custom evals instantly using your private docs & see how your custom fine-tuned models perform on your unique tasks.
Congrats to @sumukx @clefourrier and @ailozovskaya for their incredible work !
Game-changing for LLM evaluation 🚀
1/2

April 3, 2025 at 9:35 AM

Nathan

@saylortwift.hf.co

Just wrapped up evaluations on @deepseek_ai's V3 0324! 🚀

Impressive gains in math and GPQA, but instruction following took a slight hit. More concerning—AIME25 remains unchanged. Possible contamination issues? 🤔

March 26, 2025 at 10:07 PM

Nathan

@saylortwift.hf.co

WOW. The Qwen team did NOT come to play.🔥
Just look at these insane results from the OpenEval team—absolutely impressive.
Huge congrats! 👏 @Alibaba_Qwen

March 10, 2025 at 12:39 PM

Nathan

@saylortwift.hf.co

Everyone's talking about GPT-4.5 quality, so we ran benchmarks!

Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).

Congrats to the team @OpenAI !

March 3, 2025 at 3:18 PM

Nathan

@saylortwift.hf.co

Everyone's talking about GPT-4.5 quality, so we ran benchmarks!
Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).

Congrats to the team @OpenAI ! Now open-source it and drop it on the Hub 🤗

March 3, 2025 at 3:05 PM

Nathan

@saylortwift.hf.co

All were run using lighteval and litellm's API, in one command in under 10mins 🏎️ 🤩

github.com/huggingface...

3/3

February 25, 2025 at 3:03 PM

Nathan

@saylortwift.hf.co

we just reproduced Claude 3.7 results for you 📈

TLDR: we get what they announced.
We also used AIME 2025 to test for contamination on the 2024 version and score are similar on both benchmarks !

Great job to the @AnthropicAI team !
More details in thread 👇
1/3

February 25, 2025 at 3:03 PM

Nathan

@saylortwift.hf.co

I used LightEval to set up Olympiad Bench and run the benchmarks. With this config, you can easily test DeepSeek on any generative task. 🚀

February 3, 2025 at 10:29 AM

Nathan

@saylortwift.hf.co

DeepSeek R1 continues to impress! I just integrated the Olympiad Bench— a collection of elite-level Chinese and English scientific problems— into LightEval and tested GPT-4o against R1. The results are insane.

Full details + how to reproduce in the thread 👇

February 3, 2025 at 10:29 AM

Nathan

@saylortwift.hf.co

This week (ish) in 🌤️ LLM evaluation 🔥
📊 A statistical approach to model evaluation @AnthropicAI
📐 Frontier MATH: a benchmark for evaluating advanced Mathematical reasoning in AI @EpochAIResearch
📝 Say What You Mean: A Response to 'Let Me Speak Freely' @dottxtai

🧵 👇

November 25, 2024 at 2:13 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news