Nathan
banner
saylortwift.hf.co
Nathan
@saylortwift.hf.co
ML engineer at @huggingface 🤗, Evaluation, Open LLM Leaderboard and lighteval
Here is an example of question and answer from Claude 3.7 Sonnet

2/N
April 22, 2025 at 2:29 PM
openai really has some nice benchmarks, one of them being simpleqa. a simple fact-checking benchmark, short questions and straight answers

i've been using @huggingface's lighteval and inference providers and litellm to evaluate all those models in less than a few hours 🤩

1/N
April 22, 2025 at 2:29 PM
🚀 Just dropped fresh benchmarks for LLaMA 4 Scout and Maverick using Lighteval!

Details below👇

1/6
April 8, 2025 at 8:53 AM
🚀 Introducing ✨ YourBench ✨ ! Build custom evals instantly using your private docs & see how your custom fine-tuned models perform on your unique tasks.
Congrats to @sumukx @clefourrier and @ailozovskaya for their incredible work !
Game-changing for LLM evaluation 🚀
1/2
April 3, 2025 at 9:35 AM
Just wrapped up evaluations on @deepseek_ai's V3 0324! 🚀

Impressive gains in math and GPQA, but instruction following took a slight hit. More concerning—AIME25 remains unchanged. Possible contamination issues? 🤔
March 26, 2025 at 10:07 PM
WOW. The Qwen team did NOT come to play.🔥
Just look at these insane results from the OpenEval team—absolutely impressive.
Huge congrats! 👏 @Alibaba_Qwen
March 10, 2025 at 12:39 PM
Everyone's talking about GPT-4.5 quality, so we ran benchmarks!

Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).

Congrats to the team @OpenAI !
March 3, 2025 at 3:18 PM
Everyone's talking about GPT-4.5 quality, so we ran benchmarks!
Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).

Congrats to the team @OpenAI ! Now open-source it and drop it on the Hub 🤗
March 3, 2025 at 3:05 PM
All were run using lighteval and litellm's API, in one command in under 10mins 🏎️ 🤩

github.com/huggingface...

3/3
February 25, 2025 at 3:03 PM
we just reproduced Claude 3.7 results for you 📈

TLDR: we get what they announced.
We also used AIME 2025 to test for contamination on the 2024 version and score are similar on both benchmarks !

Great job to the @AnthropicAI team !
More details in thread 👇
1/3
February 25, 2025 at 3:03 PM
I used LightEval to set up Olympiad Bench and run the benchmarks. With this config, you can easily test DeepSeek on any generative task. 🚀
February 3, 2025 at 10:29 AM
DeepSeek R1 continues to impress! I just integrated the Olympiad Bench— a collection of elite-level Chinese and English scientific problems— into LightEval and tested GPT-4o against R1. The results are insane.

Full details + how to reproduce in the thread 👇
February 3, 2025 at 10:29 AM
This week (ish) in 🌤️ LLM evaluation 🔥
📊 A statistical approach to model evaluation @AnthropicAI
📐 Frontier MATH: a benchmark for evaluating advanced Mathematical reasoning in AI @EpochAIResearch
📝 Say What You Mean: A Response to 'Let Me Speak Freely' @dottxtai

🧵 👇
November 25, 2024 at 2:13 PM