2/N
2/N
i've been using @huggingface's lighteval and inference providers and litellm to evaluate all those models in less than a few hours 🤩
1/N
i've been using @huggingface's lighteval and inference providers and litellm to evaluate all those models in less than a few hours 🤩
1/N
Details below👇
1/6
Details below👇
1/6
Congrats to @sumukx @clefourrier and @ailozovskaya for their incredible work !
Game-changing for LLM evaluation 🚀
1/2
Congrats to @sumukx @clefourrier and @ailozovskaya for their incredible work !
Game-changing for LLM evaluation 🚀
1/2
Impressive gains in math and GPQA, but instruction following took a slight hit. More concerning—AIME25 remains unchanged. Possible contamination issues? 🤔
Impressive gains in math and GPQA, but instruction following took a slight hit. More concerning—AIME25 remains unchanged. Possible contamination issues? 🤔
Just look at these insane results from the OpenEval team—absolutely impressive.
Huge congrats! 👏 @Alibaba_Qwen
Just look at these insane results from the OpenEval team—absolutely impressive.
Huge congrats! 👏 @Alibaba_Qwen
Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).
Congrats to the team @OpenAI !
Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).
Congrats to the team @OpenAI !
Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).
Congrats to the team @OpenAI ! Now open-source it and drop it on the Hub 🤗
Did NOT expect it to be such a leap from GPT-4o—now on par with Claude 3.7 and even ahead of DeepSeek Llama 70B (a thinking model!).
Congrats to the team @OpenAI ! Now open-source it and drop it on the Hub 🤗
github.com/huggingface...
3/3
github.com/huggingface...
3/3
TLDR: we get what they announced.
We also used AIME 2025 to test for contamination on the 2024 version and score are similar on both benchmarks !
Great job to the @AnthropicAI team !
More details in thread 👇
1/3
TLDR: we get what they announced.
We also used AIME 2025 to test for contamination on the 2024 version and score are similar on both benchmarks !
Great job to the @AnthropicAI team !
More details in thread 👇
1/3
Full details + how to reproduce in the thread 👇
Full details + how to reproduce in the thread 👇
📊 A statistical approach to model evaluation @AnthropicAI
📐 Frontier MATH: a benchmark for evaluating advanced Mathematical reasoning in AI @EpochAIResearch
📝 Say What You Mean: A Response to 'Let Me Speak Freely' @dottxtai
🧵 👇
📊 A statistical approach to model evaluation @AnthropicAI
📐 Frontier MATH: a benchmark for evaluating advanced Mathematical reasoning in AI @EpochAIResearch
📝 Say What You Mean: A Response to 'Let Me Speak Freely' @dottxtai
🧵 👇