Lightnews — Scholar-powered news

Satya Prateek

@prateek.fyi

Gemini 3 Flash beating 3 Pro on SWE Bench makes me even more suspicious of SWE Bench as a metric. It just can't be challenging enough.

December 17, 2025 at 5:01 PM

Karl Weinmeister

@kweinmeister.bsky.social

Today's antigravity.google release includes Gemini 3 Flash support. With 78% on SWE-Bench Verified and low latency, it's a great choice for everyday coding tasks.

December 17, 2025 at 8:52 PM

Tim Kellogg

@timkellogg.me

Nemotron 3

A new hybrid mamba2/attention LLM from NVIDIA that beats Qwen3-30B-A3B (same size & shape)

Notes:
* 1M context, with incredible recall past 256K
* New open datasets
* 10 open source RL environments

Overall this is a huge win for neolabs

huggingface.co/nvidia/NVIDI...

A wide bar chart comparing **accuracy** (left axis) and **relative throughput** (right axis) across multiple benchmarks for three models.

**Legend / Models**

* **Green:** Nemotron-3-Nano-30B-A3B
* **Blue:** Qwen3-30B-A3B-Thinking-2507
* **Gray:** GPT-OSS-20B-A4B

**Left Y-axis:** Accuracy (%)
**Right Y-axis:** Relative Throughput (Output tokens/s/GPU)
A dashed vertical line separates accuracy benchmarks (left) from throughput (right).

---

### Accuracy benchmarks (left to right)

* **Arena-Hard-v2-Avg (Chat):**
Nemotron **67.7**, Qwen **57.8**, GPT-OSS **48.5**

* **AIME25 (Math):**
Nemotron **99.2** (+tools noted), Qwen **85.0**, GPT-OSS **98.7**
(lighter labels near bars show intermediate values ~89.1 and ~91.7)

* **IFBench (Inst. Following):**
Nemotron **71.5**, Qwen **51.0**, GPT-OSS **65.0**

* **τ²-Bench (Tool Use):**
Nemotron **49.0**, Qwen **47.7**, GPT-OSS **47.5**

* **SWE-Bench (Coding):**
Nemotron **38.8**, Qwen **22.0**, GPT-OSS **34.0**

* **LCB v6 (Coding):**
Nemotron **68.2**, Qwen **66.0**, GPT-OSS **61.0**

* **RULER @ 1M (Long Ctx):**
Nemotron **86.3**, Qwen **77.5**, GPT-OSS **N/A**

---

### Throughput (right of dashed line)

* **ISL/OSL 8k/16k:**
Nemotron **3.3**, Qwen **1.0**, GPT-OSS **1.5**

---

**Caption (bottom):**
*Figure 2 | The hybrid Mamba-Transformer MoE architecture used by Nemotron 3 models can achieve state-of-the-art accuracy on leading reasoning benchmarks and ultra-long-context tasks while providing throughput improvements over similarly sized Transformer MoEs. For details, please see the Nemotron Nano 3 technical report.*

December 16, 2025 at 1:15 PM

Tim Kellogg

@timkellogg.me

GPT-5.2

huge numbers on ARC-AGi-2

openai.com/index/introd...

A clean comparison table of reasoning benchmark results for several top AI models. The layout is divided into three vertical model groups—OpenAI, Anthropic, and Google—with benchmarks listed down the left.

Left column (benchmarks) includes:
• SWE-Bench Pro – Software engineering
• GPQA Diamond – Science questions (no tools)
• CharXiv Reasoning – Scientific figure questions (no tools)
• FrontierMath – Advanced math (Tier 1–3, Tier 4)
• AIME 2025 – Competition math (no tools)
• ARC-AGI-1 – Abstract reasoning
• ARC-AGI-2 – Abstract reasoning
• GDPval – Knowledge work tasks

⸻

OpenAI Results

Two columns:

GPT-5.2 Thinking
• SWE-Bench Pro: 55.6%
• GPQA Diamond: 92.4%
• CharXiv: 82.1%
• FrontierMath: 40.3% (Tier 1–3), 14.6% (Tier 4)
• AIME 2025: 100.0%
• ARC-AGI-1: 86.2%
• ARC-AGI-2: 52.9%
• GDPval: 70.9%

GPT-5.1 Thinking
• SWE-Bench Pro: 50.8%
• GPQA Diamond: 88.1%
• CharXiv: 67.0%
• FrontierMath: 31.0%, 12.5%
• AIME 2025: 94.0%
• ARC-AGI-1: 72.8%
• ARC-AGI-2: 17.6%
• GDPval: 38.8%

⸻

Anthropic

Claude Opus 4.5
• SWE-Bench Pro: 52.0%
• GPQA Diamond: 87.0%
• CharXiv: —
• FrontierMath: —
• AIME 2025: 92.8%
• ARC-AGI-1: 80.0%
• ARC-AGI-2: 37.6%
• GDPval: 59.6%

⸻

Google

Gemini 3 Pro
• SWE-Bench Pro: 43.3%
• GPQA Diamond: 91.9%
• CharXiv: 81.4%
• FrontierMath: 37.6%, 18.8%
• AIME 2025: 95.0%
• ARC-AGI-1: 75.0%
• ARC-AGI-2: 31.1%
• GDPval: 53.5%

⸻

The table highlights GPT-5.2 Thinking outperforming prior versions and competitors on nearly all listed reasoning datasets.

December 11, 2025 at 6:31 PM

Tim Kellogg

@timkellogg.me

SWE-Bench Pro (they fixed the bug where xhigh doesn’t improve performance)

December 11, 2025 at 7:25 PM

Techmeme

@techmeme.com

Essential AI, whose CEO co-wrote Google's Attention Is All You Need paper, unveils Rnj-1, an 8B-parameter open model with SWE-bench performance close to GPT-4o (Ashish Vaswani/Essential AI)

Main Link | Techmeme Permalink

December 7, 2025 at 4:05 PM

Mario Zechner

@mariozechner.at

Built my own coding agent harness called pi. Think Claude Code/Codex. Ran it through terminal-bench 2.0. Screenshot 2 has the full system prompt. Only has 4 tools: read/write/edit/bash. No web search, no compaction, no auto-retries.

Placed 7, beating Claude Code and most Codex variations. LOL.

A terminal screenshot showing the Terminal-Bench 2.0 Leaderboard, displaying a ranked list of 61 AI coding agents and their performance. The table has three columns: Rank, Agent, Model, and Accuracy. At rank 7, "pi" using "Claude Opus 4 5" is highlighted in green with an accuracy of 49.8% ± 2.4 and marked with an asterisk. The top performer is Codex CLI with GPT-5.1-Codex-Max at 60.4% accuracy. The leaderboard shows various AI agents including Warp, IT-Agent, Terminus 2, OpenHands, Mini-SWE-Agent, Claude Code, and Gemini CLI, powered by models from GPT-5.1, Gemini 3 Pro, Claude variants (Opus, Sonnet, Haiku), and others, with accuracies ranging from 60.4% down to 3.1%.

You are an expert coding assistant. You help users with coding tasks by reading files, executing commands, editing code, and writing new files.

Available tools:

read: Read file contents
bash: Execute bash commands (ls, grep, find, etc.)
edit: Make surgical edits to files (find exact text and replace)
write: Create or overwrite files
Guidelines:

Use bash for file operations like ls, grep, find
Use read to examine files before editing
Use edit for precise changes (old text must match exactly)
Use write only for new files or complete rewrites
When summarizing your actions, output plain text directly - do NOT use cat or bash to display what you did
Be concise in your responses
Show file paths clearly when working with files
Documentation:

Your own documentation (including custom model setup and theme creation) is at: /path/to/README.md
Read it when users ask about features, configuration, or setup, and especially if the user asks you to add a custom model or provider, or create a custom theme.

December 1, 2025 at 2:00 AM

Michael Richmond

@beartoothskier.bsky.social

Our first real cold upper trough passage yesterday NW flow did not deliver much Beartooth Front around Red Lodge. Only 4-5cm new lite powder, tho temp. dropped -18C this morn. 👍🏽 Warmed to -8C for our run on E Bench behind town. Expansive views all directions here. #climate #running #dogs #montana

looking east from East Bench 1800m down to the dry desert basin 1100-1200m. Pryors way in back, wild horse country. strong inversion, all of Billings emissions are pushing in there in lite low-level N flow. Up here east winds.

what is most striking looking at last year this time vs. this year for SWE/Snowpack is how much lower it is over all the GYE and the Bighorns. SW MT too, Pintlers 2500-3200m where I used to love skiing along the Divide almost nothing.

-8C w the stiff E breeze was quite refreshing w the proper gear.

almost all of MT at least had some snow cover this morning, light as it is.

November 29, 2025 at 11:52 PM

MarketingPRO.bg

@marketingprobg.bsky.social

Anthropic вече представи Claude Opus 4.5, най-новата версия на водещия изкуствен интелект на компанията. Opus 4.5 трябва да се представя изключително добре в няколко бенчмарк теста, като SWE-Bench (кодиране), tau2-bench (използване на инструменти) и ... #Claude #AI marketingpro.bg/anthropic-pu...

Anthropic пуска Claude Opus 4.5 с интеграция за Chrome и Excel - MarketingPRO.bg

Anthropic вече представи Claude Opus 4.5, най-новата версия на водещия изкуствен интелект на компанията. Opus 4.5 трябва да се представя изключително добре в

marketingpro.bg

November 27, 2025 at 8:09 PM

Daniel

@0shotdev.bsky.social

1/
The loop continues - Claude Opus 4.5 dropped.
First model to break 80% on real-world software engineering (SWE-bench Verified).

But the interesting part isn't just the benchmark, it's also what Anthropic is doing to make their smartest model usable day-to-day.

#Claude #Anthropic #Opus #GenAI

November 25, 2025 at 6:01 PM

Tim Kellogg

@timkellogg.me

Opus 4.5

Now 1/3rd the cost, and SOTA in programming

Like Gemini 3 Pro, people note that it can see a lot deeper into tough problems. That big model smell..

www.anthropic.com/news/claude-...

A clean bar chart titled “Software engineering — SWE-bench Verified (n=500)” showing accuracy scores (in percent) for six models. Each bar is labeled and color-coded, with values displayed above the bars. From left to right:
• Opus 4.5 — tall orange bar at 80.9% (highest in the chart)
• Sonnet 4.5 — yellow bar at 77.2%
• Opus 4.1 — blue bar at 74.5%
• Gemini 3 Pro — light gray bar at 76.2%
• GPT-5.1-Codex-Max — light gray bar at 77.9%
• GPT-5.1 — light gray bar at 76.3%

The y-axis shows Accuracy (%), ranging from 70 to 82, and the x-axis lists the model names along the bottom.

November 24, 2025 at 8:09 PM

Tim Kellogg

@timkellogg.me

Benchmarks

- they compared against Gemini 3 👍
- they showed a decent number of benchmarks
- It *actually* does well compared against Gemini

A table comparing multiple AI models across a wide range of benchmarks. The Opus 4.5 column is highlighted in a light red tint. Each row lists a task category on the left, followed by performance percentages for each model: Opus 4.5, Sonnet 4.5, Opus 4.1, Gemini 3 Pro, and GPT-5.1 (with some GPT-5.1 Codex-Max scores shown in smaller text).

Rows and values:

⸻

Agentic coding — SWE-bench Verified
• Opus 4.5: 80.9%
• Sonnet 4.5: 77.2%
• Opus 4.1: 74.5%
• Gemini 3 Pro: 76.2%
• GPT-5.1: 76.3% (77.9% Codex-Max)

Agentic terminal coding — Terminal-bench 2.0
• Opus 4.5: 59.3%
• Sonnet 4.5: 50.0%
• Opus 4.1: 46.5%
• Gemini 3 Pro: 54.2%
• GPT-5.1: 47.6% (58.1% Codex-Max)

Agentic tool use — tz-bench

Retail / Telecom results:

Retail:
• Opus 4.5: 88.9%
• Sonnet 4.5: 86.2%
• Opus 4.1: 86.8%
• Gemini 3 Pro: 85.3%
• GPT-5.1: —

Telecom:
• Opus 4.5: 98.2%
• Sonnet 4.5: 98.0%
• Opus 4.1: 71.5%
• Gemini 3 Pro: 98.0%
• GPT-5.1: —

Scaled tool use — MCP Atlas
• Opus 4.5: 62.3%
• Sonnet 4.5: 43.8%
• Opus 4.1: 40.9%
• Gemini 3 Pro: —
• GPT-5.1: —

Computer use — OSWorld
• Opus 4.5: 66.3%
• Sonnet 4.5: 61.4%
• Opus 4.1: 44.4%
• Gemini 3 Pro: —
• GPT-5.1: —

Novel problem solving — ARC-AGI-2 (Verified)
• Opus 4.5: 37.6%
• Sonnet 4.5: 13.6%
• Opus 4.1: —
• Gemini 3 Pro: 31.1%
• GPT-5.1: 17.6%

Graduate-level reasoning — GPQA Diamond
• Opus 4.5: 87.0%
• Sonnet 4.5: 83.4%
• Opus 4.1: 81.0%
• Gemini 3 Pro: 91.9% (highest in row)
• GPT-5.1: 88.1%

Visual reasoning — MMMU (validation)
• Opus 4.5: 80.7%
• Sonnet 4.5: 77.8%
• Opus 4.1: 77.1%
• Gemini 3 Pro: —
• GPT-5.1: 85.4% (highest in row)

Multilingual Q&A — MMMU
• Opus 4.5: 90.8%
• Sonnet 4.5: 89.1%
• Opus 4.1: 89.5%
• Gemini 3 Pro: 91.8%
• GPT-5.1: 91.0%

⸻

November 24, 2025 at 8:18 PM

ToxSec

@toxsec.bsky.social

Claude 4.5 Opus scored 4.6% higher on the SWE-bench.

Wow.

#ai #cybersecurity #tech

November 24, 2025 at 8:21 PM

Tim Kellogg

@timkellogg.me

Anthropic has been exploring new ways of burning tokens

put differently: you can get Opus high now!

A line chart titled “Software engineering with effort controls — SWE-bench Verified (n=500)” showing how accuracy changes with output token count for two models: Opus 4.5 (orange) and Sonnet 4.5 (yellow).

Opus 4.5 (orange line with three labeled points):
• Low effort: ~4,000 tokens, 75% accuracy
• Medium effort: ~6,000 tokens, 78% accuracy
• High effort: ~12,000 tokens, 81% accuracy

A smooth upward-sloping line connects these three points, showing that accuracy increases as output tokens increase.

Sonnet 4.5 (yellow single point):
• A single dot around 22,000 tokens and ~77% accuracy, with no line connecting it.

Axes & Notes:
• Y-axis: Accuracy (%) from 70 to 85
• X-axis: Output tokens from 0 to 25,000
• Caption notes that measurements were done with extended thinking off, and that turning it on increases output tokens by +5.4% on average.

Legend:
• ⬤ Opus 4.5 (orange)
• ⬤ Sonnet 4.5 (yellow)

The chart illustrates how higher “effort” (i.e., more output tokens) boosts software-engineering accuracy, especially for Opus 4.5.

November 24, 2025 at 8:34 PM

Tim Kellogg

@timkellogg.me

GPT-5.1-Codex-Max

better faster stronger

A line chart titled “SWE-Bench Verified (n=500)” comparing two models:
• GPT-5.1-Codex (light blue)
• GPT-5.1-Codex-Max (dark blue)

X-axis: “Thinking tokens,” ranging from 0 to 40,000
Y-axis: “Accuracy,” spanning roughly 68% to 84%

Both models show accuracy increasing as thinking tokens increase.

GPT-5.1-Codex (light blue)
• Low (~7k tokens): ~70%
• Medium (~12k tokens): ~72%
• High (~20k tokens): ~73.5%

GPT-5.1-Codex-Max (dark blue)
• Low (~7k tokens): ~70%
• Medium (~11k tokens): ~73%
• High (~19k tokens): ~76.5%
• X-high (~36k tokens): ~78%

Points are labeled “low,” “medium,” “high,” and “xhigh” next to their markers.

GPT-5.1-Codex-Max consistently sits above the regular Codex line, especially at higher thinking-token budgets.

November 19, 2025 at 6:53 PM

ByteVagabond

@bytevagabond.com

Gemini 3 leak shows solid gains on math, vision, and SimpleQA

Sonnet still ahead on SWE-bench though, while Gemini takes TerminalBench

Nice to see models getting better at different things

November 18, 2025 at 2:41 PM

Karl Weinmeister

@kweinmeister.bsky.social

Google's most advanced model, Gemini 3, is now live!

📈 1487 Elo on WebDev Arena, 76.2% on SWE-bench Verified
🛠️ Try out Google Antigravity: A new agentic IDE with direct access to the terminal, editor, and browser to build and validate code.

blog.google/products/gem...

A new era of intelligence with Gemini 3

Today we’re releasing Gemini 3 – our most intelligent model that helps you bring any idea to life.

blog.google

November 18, 2025 at 4:16 PM

Tim Kellogg

@timkellogg.me

GPT-5.1 Benchmarks

better late than never, i guess

Evaluation
GPT-5.1 (high)
GPT-5 (high)
SWE-bench Verified (all 500 problems)
76.3%
72.8%
GPQA Diamond (no tools)
88.1%
85.7%
AIME 2025 (no tools)
94.0%
94.6%
FrontierMath (with Python tool)
26.7%
26.3%
MMMU
85.4%
84.2%
Tau?-bench Airline
67.0%
62.6%
Tau?-bench Telecom*
95.6%
96.7%
Tau?-bench Retail
77.9%
81.1%
BrowseComp Long Context 128k| 90.0%

90.0%
Appendix: Model evaluations
* For Tau?-bench Telecom, we gave GPT-5.1 a short, generically helpful prompt to improve its performance.

November 13, 2025 at 11:57 PM

Tim Kellogg

@timkellogg.me

fun time trying to read this graph

A black-background line chart titled “SWE-bench Verified.”

Legend:
• Light gray dot/line labeled GPT-5
• Pink dot/line labeled GPT-5.1

Axes:
• Y-axis: Accuracy, ranging from 60% to 80%
• X-axis: Thinking tokens, ranging from 0 to 25,000

Data points (each annotated with reasoning-effort labels):

GPT-5 (gray):
• At 0 tokens: ~63% accuracy, labeled none
• At ~1,000 tokens: ~65% accuracy, labeled low
• At ~3,000 tokens: ~69% accuracy, labeled medium
• At ~10,000 tokens: ~72% accuracy, labeled high

GPT-5.1 (pink):
• At 0 tokens: ~64% accuracy, labeled minimal
• At ~1,000 tokens: ~68% accuracy, labeled low
• At ~3,000 tokens: ~70% accuracy, labeled medium
• At ~20,000 tokens: ~77% accuracy, labeled high

A thin pink line connects the GPT-5.1 points, and a thin gray line connects the GPT-5 points, with GPT-5.1 slightly outperforming GPT-5 at all reasoning budgets.

Caption text beneath the chart:
“In SWE-bench Verified, a model is given a code repository and issue description, and must generate a patch to solve the issue. Labels indicate reasoning effort. Accuracy is averaged across all 500 problems. All models used a harness with JSON-based apply_patch tool.”

November 13, 2025 at 11:59 PM

Tim Kellogg

@timkellogg.me

GPT-5-codex-mini

Almost same performance as GPT-5-codex on high, but 4x faster and without pesky things like warm personality

www.neowin.net/amp/openai-i...

November 8, 2025 at 4:46 PM

Epoch AI

@epochai.bsky.social

2. While a model with a score of 140 is expected to get 45% on SWE-Bench Verified, this is just an expectation. Individual models perform better or worse on specific tasks.

For instance, GPT-5 underperforms in GPQA Diamond but overperforms in VPCT.

November 7, 2025 at 7:13 PM

heute jedoch nicht

@heutejedochnicht.bsky.social

großartig

„For example, OpenAI claimed that its model,
03-mini, achieves 49.3% accuracy on the SWE-Bench Verified benchmark a rapid and substantial improvement over earlier large language models. Prior benchmarks on the same dataset reported a slightly earlier model GPT-40 achieving only 18.83% accuracy.
An independent study from researchers at the Lassonde School of Engineering, York University, found major flaws in the SWE-benchmark, significantly lowering GPT40's actual performance (from 18.83% to 3.83% accuracy). Many of the Al machine fixes are described as suspicious, including "cheating" and tests so weak that even an incorrect solution can pass the tests. These Al machines are not good at solving software issues on the kind of work that you might give a junior programmer.“

November 3, 2025 at 3:37 PM

Sung Kim

@sungkim.bsky.social

Huxley-Gödel Machine learns to rewrite its own code, estimating its own long-term self-improvement potential. It generalizes on new tasks (SWE-Bench Lite), matching the best officially checked human-engineered agents.

Paper: arxiv.org/abs/2510.21614
Repo: github.com/metauto-ai/HGM

October 28, 2025 at 1:54 AM

Tim Kellogg

@timkellogg.me

MiniMax open sources M2

This model has been shaking the benchmarks last week, now that it’s open we see that it’s 230B-A10B and dueling (arguably beating) Sonnet 4.5 at 8% of the cost

github.com/MiniMax-AI/M...

Eight side-by-side bar charts (2 rows × 4 columns) comparing models; each bar has a small icon that matches the legend at the bottom.

Top row, left → right within each chart:

SWE-bench Verified — 69.4, 67.8, 68.0, 69.2, 63.8, 77.2, 74.9.
Multi-SWE-Bench — 36.2, 30.6, 30.0, 33.5, 44.3.
Terminal-Bench — 46.3, 37.7, 40.5, 44.5, 25.3, 50.0, 43.8.
ArtifactsBench — 66.8, 55.8, 59.8, 54.2, 57.7, 61.5, 73.0.

Bottom row, left → right within each chart:

T²-Bench — 77.2, 66.7, 75.9, 70.3, 59.2, 84.7, 80.1.
GAIA (text only) — 75.7, 63.5, 71.9, 60.2, 60.2, 71.2, 76.4.
BrowseComp — 44.0, 40.1, 45.1, 14.1, 9.9, 19.6, 54.9.
FinSearchComp-global — 65.5, 26.2, 29.2, 29.5, 42.6, 60.8, 63.9.

Legend (icons → model names):
MiniMax-M2; DeepSeek-V3.2; GLM-4.6; Kimi K2 0905; Gemini 2.5 Pro; Claude Sonnet 4.5; GPT-5 (thinking).

October 27, 2025 at 11:28 AM

Tommi Somersuo

@tommis.fi

github.com/princeton-nl... : On the full SWE-bench test set, SWE-agent resolves 12.29% of issues, achieving the state-of-the-art performance on the full test set.

GitHub - princeton-nlp/SWE-agent: SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models

SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models - princeton-nlp/SWE-agent

github.com

April 3, 2024 at 6:30 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news