#swe-bench
Gemini 3 Flash beating 3 Pro on SWE Bench makes me even more suspicious of SWE Bench as a metric. It just can't be challenging enough.
December 17, 2025 at 5:01 PM
Today's antigravity.google release includes Gemini 3 Flash support. With 78% on SWE-Bench Verified and low latency, it's a great choice for everyday coding tasks.
December 17, 2025 at 8:52 PM
Nemotron 3

A new hybrid mamba2/attention LLM from NVIDIA that beats Qwen3-30B-A3B (same size & shape)

Notes:
* 1M context, with incredible recall past 256K
* New open datasets
* 10 open source RL environments

Overall this is a huge win for neolabs

huggingface.co/nvidia/NVIDI...
December 16, 2025 at 1:15 PM
GPT-5.2

huge numbers on ARC-AGi-2

openai.com/index/introd...
December 11, 2025 at 6:31 PM
SWE-Bench Pro (they fixed the bug where xhigh doesn’t improve performance)
December 11, 2025 at 7:25 PM
Essential AI, whose CEO co-wrote Google's Attention Is All You Need paper, unveils Rnj-1, an 8B-parameter open model with SWE-bench performance close to GPT-4o (Ashish Vaswani/Essential AI)

Main Link | Techmeme Permalink
December 7, 2025 at 4:05 PM
Built my own coding agent harness called pi. Think Claude Code/Codex. Ran it through terminal-bench 2.0. Screenshot 2 has the full system prompt. Only has 4 tools: read/write/edit/bash. No web search, no compaction, no auto-retries.

Placed 7, beating Claude Code and most Codex variations. LOL.
December 1, 2025 at 2:00 AM
Our first real cold upper trough passage yesterday NW flow did not deliver much Beartooth Front around Red Lodge. Only 4-5cm new lite powder, tho temp. dropped -18C this morn. 👍🏽 Warmed to -8C for our run on E Bench behind town. Expansive views all directions here. #climate #running #dogs #montana
November 29, 2025 at 11:52 PM
Anthropic вече представи Claude Opus 4.5, най-новата версия на водещия изкуствен интелект на компанията. Opus 4.5 трябва да се представя изключително добре в няколко бенчмарк теста, като SWE-Bench (кодиране), tau2-bench (използване на инструменти) и ... #Claude #AI marketingpro.bg/anthropic-pu...
Anthropic пуска Claude Opus 4.5 с интеграция за Chrome и Excel - MarketingPRO.bg
Anthropic вече представи Claude Opus 4.5, най-новата версия на водещия изкуствен интелект на компанията. Opus 4.5 трябва да се представя изключително добре в
marketingpro.bg
November 27, 2025 at 8:09 PM
1/
The loop continues - Claude Opus 4.5 dropped.
First model to break 80% on real-world software engineering (SWE-bench Verified).

But the interesting part isn't just the benchmark, it's also what Anthropic is doing to make their smartest model usable day-to-day.

#Claude #Anthropic #Opus #GenAI
November 25, 2025 at 6:01 PM
Opus 4.5

Now 1/3rd the cost, and SOTA in programming

Like Gemini 3 Pro, people note that it can see a lot deeper into tough problems. That big model smell..

www.anthropic.com/news/claude-...
November 24, 2025 at 8:09 PM
Benchmarks

- they compared against Gemini 3 👍
- they showed a decent number of benchmarks
- It *actually* does well compared against Gemini
November 24, 2025 at 8:18 PM
Claude 4.5 Opus scored 4.6% higher on the SWE-bench.

Wow.

#ai #cybersecurity #tech
November 24, 2025 at 8:21 PM
Anthropic has been exploring new ways of burning tokens

put differently: you can get Opus high now!
November 24, 2025 at 8:34 PM
GPT-5.1-Codex-Max

better faster stronger
November 19, 2025 at 6:53 PM
Gemini 3 leak shows solid gains on math, vision, and SimpleQA

Sonnet still ahead on SWE-bench though, while Gemini takes TerminalBench

Nice to see models getting better at different things
November 18, 2025 at 2:41 PM
Google's most advanced model, Gemini 3, is now live!

📈 1487 Elo on WebDev Arena, 76.2% on SWE-bench Verified
🛠️ Try out Google Antigravity: A new agentic IDE with direct access to the terminal, editor, and browser to build and validate code.

blog.google/products/gem...
A new era of intelligence with Gemini 3
Today we’re releasing Gemini 3 – our most intelligent model that helps you bring any idea to life.
blog.google
November 18, 2025 at 4:16 PM
GPT-5.1 Benchmarks

better late than never, i guess
November 13, 2025 at 11:57 PM
fun time trying to read this graph
November 13, 2025 at 11:59 PM
GPT-5-codex-mini

Almost same performance as GPT-5-codex on high, but 4x faster and without pesky things like warm personality

www.neowin.net/amp/openai-i...
November 8, 2025 at 4:46 PM
2. While a model with a score of 140 is expected to get 45% on SWE-Bench Verified, this is just an expectation. Individual models perform better or worse on specific tasks.

For instance, GPT-5 underperforms in GPQA Diamond but overperforms in VPCT.
November 7, 2025 at 7:13 PM
großartig
November 3, 2025 at 3:37 PM
Huxley-Gödel Machine learns to rewrite its own code, estimating its own long-term self-improvement potential. It generalizes on new tasks (SWE-Bench Lite), matching the best officially checked human-engineered agents.

Paper: arxiv.org/abs/2510.21614
Repo: github.com/metauto-ai/HGM
October 28, 2025 at 1:54 AM
MiniMax open sources M2

This model has been shaking the benchmarks last week, now that it’s open we see that it’s 230B-A10B and dueling (arguably beating) Sonnet 4.5 at 8% of the cost

github.com/MiniMax-AI/M...
October 27, 2025 at 11:28 AM
github.com/princeton-nl... : On the full SWE-bench test set, SWE-agent resolves 12.29% of issues, achieving the state-of-the-art performance on the full test set.
GitHub - princeton-nlp/SWE-agent: SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models
SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models - princeton-nlp/SWE-agent
github.com
April 3, 2024 at 6:30 AM