Lightnews — Scholar-powered news

Tim Kellogg

@timkellogg.me

LLaDA-2.0: the largest text diffusion model ever

- 100B-A6B MoE architecture
- 535 tok/s
- Competitive with Qwen3-30B-A3B

🤔🤔🤔

huggingface.co/inclusionAI/...

A figure consisting of eight small bar charts arranged in two rows, titled at the bottom “Figure 1: LLaDA2.0-flash main results.” Each chart compares four models using color-coded bars: dark blue (LLaDA2.0-flash), light blue (LLaDA2.0-flash-preview), light gray (Ling-flash-2.0), and darker gray (Qwen3-30B-A3B-Instruct-2507). All vertical axes run from 0 to 100.

Top row:
• MMLU-Pro: LLaDA2.0-flash 73.4; flash-preview 49.2; Ling-flash-2.0 76.8; Qwen3 74.2.
• HellaSwag: 85.0; 86.0; 81.6; 86.3.
• HumanEval: 94.5; 80.5; 86.0; 93.3.
• AIME 2025: 60.0; 23.3; 55.9; 61.9.

Bottom row:
• BFCL_Live: 75.4; 74.1; 67.6; 73.2.
• Spider: 82.5; 81.4; 80.6; 81.8.
• MBPP: 88.3; 77.8; 85.0; 86.7.
• LiveCodeBench v6: 42.3; 28.6; 44.1; 41.6.

A legend centered below the charts maps the four colors to the model names.

December 13, 2025 at 2:33 PM

A.V.

@slckl.bsky.social

Mistral dropped ministral 3B, 8B, 14B models and the big one - a seemingly deepseek shaped Mistral large 3, 675B moe brick. All apache 2!

Happy to see some European action in the usable model space.

Mistral blog post: mistral.ai/news/mistral-3

mistral 3 benchmarks, showing it being competitive with deepseek 3.2 and kimi-k2 on MMLU, GPQA-Diamond, SimpleQA, AMC and LiveCodeBench.

December 2, 2025 at 7:19 PM

Tim Kellogg

@timkellogg.me

Mistral Large 3: 675B-A41B

Instruction-tuned (non-thinking), Apache 2, European open model keeps up

mistral.ai/news/mistral-3

Here’s what the chart is showing in plain terms — and why it matters.

⸻

What the Chart Shows

This is a benchmark comparison across several reasoning and coding tasks:
• MMLU (8-language average)
• GPQA-Diamond (5-shot, no CoT)
• SimpleQA (exact match)
• AMC (math competition)
• LiveCodeBench (no CoT)

The three models compared are:

1. Mistral Large 3 (675B) – orange bars

2. DeepSeek-V3.1 (670B) – gray bars with bird icon

3. Kimi-K2 (1.2T) – gray bars with K

⸻

Quick Performance Takeaways

1. Mistral Large 3 outperforms DeepSeek-3.1 and even Kimi-K2 on some key benchmarks

Especially:
• GPQA-Diamond → 43.9 (vs. 41.9 DeepSeek, 35.6 K2)
• AMC → 52.0 (vs. 46.4 DeepSeek, 54.4 K2)
• SimpleQA → 23.8 (vs. 19.7 DeepSeek, 26.0 K2 — here K2 wins)

Mistral’s 675B continues to punch high in reasoning-heavy tasks.

⸻

2. DeepSeek-V3.1 (670B) is consistently behind Mistral and Kimi-K2 in this set

DeepSeek’s earlier 3.1 generation was known to be weaker in:
• No-CoT reasoning
• Pure math problems
• Coding without tool use

This chart confirms the same pattern.

⸻

3. Kimi-K2 (1.2T) is stronger than DeepSeek-3.1 but inconsistent

K2 wins:
• SimpleQA
• AMC
• LiveCodeBench

But loses hard on GPQA-Diamond (35.6) and SimpleQA (26.0 is nice, but still not dominant).

K2 tends to do great on clean math, multi-turn logic, and code, but struggles with noisy or broad knowledge tasks.

⸻

What This Means in the 2025 Landscape

This chart essentially shows:

⭐ Mistral Large 3 (675B) is the most balanced model across knowledge, reasoning, and coding.
• Performs strongly even without chain-of-thought.
• Beats DeepSeek-3.1 across the board.
• Keeps up with and sometimes exceeds K2 despite being ~half the parameter count.

⚠️ DeepSeek-3.1 is now outdated.

DeepSeek-3.2 Thinking / Speciale moves far beyond this.

🔺 Kimi-K2 remains powerful for math and coding but uneven overall.

December 2, 2025 at 10:20 PM

josiahpancat.bsky.social

@josiahpancat.bsky.social

、GPQA Diamond（Rein 等人，2023 年）、SimpleQA（OpenAI，2024 年c）、C-SimpleQA（He 等人，2024 年）、SWE-Bench、2024）、SWE-Bench Verified（OpenAI，2024d）、Aider 1 、LiveCodeBench（Jain 等人，2024）（2024-08 - 2025-01）、Codeforces 2 、中国全国高中数学奥林匹克竞赛（CNMO 2024） 3 和 2024 年美国数学邀请考试（AIME 2024）（MAA，2024）。

March 2, 2025 at 6:43 AM

Jan

@jandotai.bsky.social

GemmaCoder-12B: Code-specialized Gemma-12B boosting LiveCodeBench by 50% (21.9% → 32.9%).

Fine-tuned via SFT on competitive coding (Codeforces). Thanks @ben_burtenshaw!

To run it locally, click Use this model on @huggingface and select Jan: huggingface.co/bartowski/b...

bartowski/burtenshaw_GemmaCoder3-12B-GGUF · Hugging Face

huggingface.co

April 8, 2025 at 2:33 AM

josiahpancat.bsky.social

@josiahpancat.bsky.social

- 利用 DeepSeek-R1 生成的推理数据，我们对研究界广泛使用的几个密集模型进行了微调。评估结果表明，经过提炼的小型密集模型在基准测试中表现优异。DeepSeek-R1-Distill-Qwen-7B 在 AIME 2024 上的得分率达到 55.5%，超过了 QwQ-32B-Preview。此外，DeepSeek-R1-Distill-Qwen-32B在AIME 2024上的得分率为72.6%，在MATH-500上的得分率为94.3%，在LiveCodeBench上的得分率为57.2%。这些结果明显优于以前的开源模型，与 o1-mini 不相上下。我们向社区开源了基于

March 2, 2025 at 5:09 AM

Sung Kim

@sungkim.bsky.social

Nvidia's AceMath-RL-Nemotron-7B, an open math model trained with reinforcement learning from the SFT-only checkpoint: Deepseek-R1-Distilled-Qwen-7B.

It achieves:
- AIME24: 69.0
- AIME25: 53.6
- LiveCodeBench: 44.4

April 25, 2025 at 1:38 AM

Paper

@paper.bsky.social

[25/30] 49 Likes, 12 Comments, 1 Posts
2505.08311, cs․CL, 13 May 2025

🆕AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale

Yunjie Ji, Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Han Zhao, Xiangang Li

$We present AM-Thinking-v1, a 32B dense language model that advances the frontier of reasoning, embodying the collaborative spirit of open-source innovation. Outperforming DeepSeek-R1 and rivaling leading Mixture-of-Experts (MoE) models like Qwen3-235B-A22B and Seed1.5-Thinking, AM-Thinking-v1 achieves impressive scores of 85.3 on AIME 2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, showcasing state-of-the-art mathematical and coding capabilities among open-source models of similar scale. Built entirely from the open-source Qwen2.5-32B base model and publicly available queries, AM-Thinking-v1 leverages a meticulously crafted post-training pipeline - combining supervised fine-tuning and reinforcement learning - to deliver exceptional reasoning capabilities. This work demonstrates that the open-source community can achieve high performance at the 32B scale, a practical sweet spot for deployment and fine-tuning. By striking a balance between top-tier performance and real-world usability, we hope AM-Thinking-v1 inspires further collaborative efforts to harness mid-scale models, pushing reasoning boundaries while keeping accessibility at the core of innovation. We have open-sourced our model on \href{https://huggingface.co/a-m-team/AM-Thinking-v1}{Hugging Face}.$

May 18, 2025 at 12:09 AM

Tim Kellogg

@timkellogg.me

it’s new entrant week! today? Kimi-K2

an open weights model that’s competitive with Claude 4 Opus

- 1T, 32B active MoE
- a true agentic model, hitting all the marks on coding & tool use
- no training instability, due to MuonClip optimizer

new frontier lab to watch!

moonshotai.github.io/Kimi-K2/

July 11, 2025 at 4:43 PM

Tania 🍉

@taniaandersen.bsky.social

Det går lidt stærkt lige nu. 😮 qwenlm.github.io/blog/qwen2.5... #dktech #dkai

Qwen2.5-Max outperforms DeepSeek V3 in benchmarks such as Arena-Hard, LiveBench, LiveCodeBench, and GPQA-Diamond, while also demonstrating competitive results in other assessments, including MMLU-Pro.

January 29, 2025 at 12:14 PM

Awakari

@bluesky.awakari.com

Qwen 2.5 Coder and Qwen 3 Lead in Open Source LLM Over DeepSeek and Meta Qwen 2.5 Coder/Max is currently the top open-source model for coding, with the highest HumanEval (~70–72%), LiveCodeBench (70.7), and Elo (2056) scores among open models. DeepSee...

| Details | Interest | Feed |

Origin

www.nextbigfuture.com

May 21, 2025 at 3:39 PM

anandraghavan.bsky.social

@anandraghavan.bsky.social

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

arxiv.org/pdf/2506.11928

arxiv.org

June 19, 2025 at 5:39 PM

quinta - Stefano Quintarelli

@quinta.mastodon.uno.ap.brid.gy

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? https://blog.quintarelli.it/2025/06/livecodebench-pro-how-do-olympiad-medalists-judge-llms-in-competitive-programming/

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Stagisti… so’ stagisti digitali. BTW, in alcuni ambienti gira voce – totalmente non confermata – che il problema software che ha causato il down di Google Cloud di alcuni giorni fa sia stato sviluppato usando AI. Sarebbe interessante una smentita da Google, se possibile. Source: New York University > Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. > > …we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination. … > > Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel Continua qui: _[2506.11928] LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?_ If you like this post, please consider sharing it.

blog.quintarelli.it

June 17, 2025 at 2:05 PM

Jan

@jandotai.bsky.social

LN-Ultra introduces a system prompt ("detailed thinking on/off") to switch between standard chat and multi-step reasoning. No separate models. On GPQA-Diamond, accuracy jumps from 46% (chat) to 76% (reasoning). Same pattern holds for MATH500 (80% → 97%) and LiveCodeBench.

May 6, 2025 at 10:28 AM

arXiv cs.CV Computer Vision and Pattern Recognition

@cscv-bot.bsky.social

R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 79.0 on AIME2024, 63.6 on LiveCodeBench, and 74.0 on MMMU. These results underscore R1V2's superiority over existing open-source models and demonstrate significant progress in [5/6 of https://arxiv.org/abs/2504.16656v1]

April 24, 2025 at 6:03 AM

Tim Kellogg

@timkellogg.me

Qwen3-Next-80B-A3B Base, Instruct & Thinking

- performs similar to Qwen3-235B-A22B
- 10% the training cost of Qwen3-32B
- 10x throughput of -32B
- outperforms Gemini-2.5-flash on some benchmarks
- native MTP for speculative decoding

qwen.ai/blog?id=4074...

The image compares Qwen3 model variants in terms of MMLU accuracy, training cost, and throughput performance.

Left Panel (Scatter Plot):
• Y-axis: MMLU Accuracy (ranging from 80 to 85).
• X-axis: Training Cost (normalized, with Qwen3-32B at 100%).
• Models plotted:
• Qwen3-30B-A3B → 81.38 accuracy, 12.3% cost.
• Qwen3-32B → 83.61 accuracy, 100% cost.
• Qwen3-Next-80B-A3B → 84.72 accuracy, 9.3% cost.
• Arrows highlight:
• Better Performance (green arrow) going from Qwen3-30B-A3B to Qwen3-Next-80B-A3B.
• 10.7× Acceleration (purple arrow) going from Qwen3-32B to Qwen3-Next-80B-A3B.

Right Panel (Bar Charts):
• Prefill Throughput (32K):
• Qwen3-32B → baseline (×1.0).
• Qwen3-30B-A3B → ×5.2.
• Qwen3-Next-80B-A3B → ×10.6.
• Decode Throughput (32K):
• Qwen3-32B → baseline (×1.0).
• Qwen3-30B-A3B → ×3.5.
• Qwen3-Next-80B-A3B → ×10.0.

This visualization highlights that Qwen3-Next-80B-A3B achieves the best MMLU accuracy (84.72), with dramatically lower training cost (only 9.3%) and far higher throughput efficiency (10× or more) compared to Qwen3-32B.

This bar chart compares Qwen3-Next-80B-A3B-Thinking, Gemini-2.5-Flash-Thinking, Qwen3-32B-Thinking, and Qwen3-30B-A3B-Thinking2507 across five benchmarks.

Benchmarks & Scores:
1. SuperGPQA
• Qwen3-Next-80B-A3B-Thinking: 60.8
• Gemini-2.5-Flash-Thinking: 57.8
• Qwen3-32B-Thinking: 54.1
• Qwen3-30B-A3B-Thinking2507: 56.8
2. AIME25
• Qwen3-Next-80B-A3B-Thinking: 87.8
• Gemini-2.5-Flash-Thinking: 72.0
• Qwen3-32B-Thinking: 72.9
• Qwen3-30B-A3B-Thinking2507: 85.0
3. LiveCodeBench v6 (25.02–25.05)
• Qwen3-Next-80B-A3B-Thinking: 68.7
• Gemini-2.5-Flash-Thinking: 61.2
• Qwen3-32B-Thinking: 60.6
• Qwen3-30B-A3B-Thinking2507: 66.0
4. Arena-Hard v2
• Qwen3-Next-80B-A3B-Thinking: 62.3
• Gemini-2.5-Flash-Thinking: 56.7
• Qwen3-32B-Thinking: 48.4
• Qwen3-30B-A3B-Thinking2507: 56.0
5. LiveBench (20241125)
• Qwen3-Next-80B-A3B-Thinking: 76.6
• Gemini-2.5-Flash-Thinking: 74.3
• Qwen3-32B-Thinking: 74.9
• Qwen3-30B-A3B-Thinking2507: 76.8

Key Takeaways:
• Qwen3-Next-80B-A3B-Thinking leads in SuperGPQA, AIME25, LiveCodeBench v6, and Arena-Hard v2.
• On LiveBench (20241125), Qwen3-30B-A3B-Thinking2507 slightly edges out with 76.8 vs. 76.6.
• Gemini-2.5-Flash-Thinking is consistently competitive but lags behind the Qwen3-Next model.

September 11, 2025 at 8:37 PM

arXiv cs.SE Software Engineering

@csse-bot.bsky.social

(\textbf{SEW}), a novel self-evolving framework that automatically generates and optimises multi-agent workflows. Extensive experiments on three coding benchmark datasets, including the challenging LiveCodeBench, demonstrate that our SEW can [4/6 of https://arxiv.org/abs/2505.18646v1]

May 27, 2025 at 6:00 AM

0xWulf

@hexawulf.bsky.social

🚀 xAI’s Grok-4 Fast just dropped:
🤖 Matches Gemini 2.5 Pro on reasoning (AAI 60)
💸 ~25× cheaper to run than rivals
⚡ 2.5× faster than GPT-5 API
🏆 #1 on LiveCodeBench
The intelligence–cost frontier just got broken.

September 21, 2025 at 12:11 AM

METR

@metr.org

Time horizon isn’t relevant on all benchmarks. Hard LeetCode problems (LiveCodeBench) and math problems (AIME) are much harder for models than easy ones, but Video-MME questions on long videos aren’t much harder than on short ones.

July 14, 2025 at 6:22 PM

arXiv cs.SE Software Engineering

@csse-bot.bsky.social

results on LiveCodeBench (20240701-20240901) demonstrate that our COT-Coder-7B-StepDPO, derived from Qwen2.5-Coder-7B-Base, with a pass@1 accuracy of 21.88, exceeds all models with similar or even larger sizes. Furthermore, our COT-Coder-32B-StepDPO, [4/6 of https://arxiv.org/abs/2505.10594v1]

May 19, 2025 at 6:00 AM

arXiv cs.SE Software Engineering

@csse-bot.bsky.social

expert analysis using LiveCodeBench to assess whether the correct predictions are based on sound reasoning. We also evaluated prediction stability across different code mutations on LiveCodeBench and CruxEval. Our findings show that some LLMs, such as [5/6 of https://arxiv.org/abs/2505.10443v1]

May 16, 2025 at 6:02 AM

Boreas_Montenegro 🇲🇪 🇪🇺

@boreasmn.bsky.social

The newly upgraded Deepseek R1 is now nearly matching OpenAI's O3 High model on LiveCodeBench—a major victory for open source!
#DeepseekR1 #OpenSourceAI #LiveCodeBench #AIbenchmark #LLM #CodeAI #OpenAI #MachineLearning #AICommunity

June 13, 2025 at 4:03 PM

carl24k

@carl24k.bsky.social

Are you someone who works with code? Do you want to tell the hype from reality in #LLM coding assistants? Apple created a new coding benchmark #livecodebench with help from human Olympiad medalists, preventing contamination with continuously updated problems. Top findings 🧵

June 18, 2025 at 2:15 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news