Lightnews — Scholar-powered news

Thaddée Tyl

@espadrine.bsky.social

Where GLM-4.7 Flash shines is when you feed it enormous inputs.
That is typical of agentic coding tools. It’s on the Pareto frontier there.

Better than GPT-OSS 20B, cheaper and faster than Devstral Small 2.

Tim Kellogg @timkellogg.me · 16d

GLM-4.7-Flash — a 30B-A3B

Fits on a Macbook, does phenomenal on agentic & coding benchmarks

huggingface.co/zai-org/GLM-...

Benchmark
GLM-
4.7-
Flash
Qwen3-30B-A3B-Thinking-2507
GPT-
OSS-
20B
AIME 25
91.6
85.0
91.7
GPQA
75.2
73.4
71.5
LCB v6
64.0
66.0
61.0
HLE
14.4
9.8
10.9
SWE-bench
Verified
59.2
22.0
34.0
T2-Bench
79.5
49.0
47.7
BrowseComp
42.8
2.29
28.3

January 23, 2026 at 2:59 PM

Thaddée Tyl

@espadrine.bsky.social

What happened to Z.ai servers in December, for them to suddenly have a spikey boost in token throughput?!

January 17, 2026 at 11:47 PM

Reposted by Thaddée Tyl

hardmaru

@hardmaru.bsky.social

One of my favorite findings: Positional embeddings are just training wheels. They help convergence but hurt long-context generalization.

We found that if you simply delete them after pretraining and recalibrate for <1% of the original budget, you unlock massive context windows. Smarter, not harder.

sakanaai.bsky.social @sakanaai.bsky.social · 23d

Introducing DroPE: Extending Context by Dropping Positional Embeddings

We found embeddings like RoPE aid training but bottleneck long-sequence generalization. Our solution’s simple: treat them as a temporary training scaffold, not a permanent necessity.

arxiv.org/abs/2512.12167
pub.sakana.ai/DroPE

January 12, 2026 at 4:12 AM

Thaddée Tyl

@espadrine.bsky.social

M2.1 from @MiniMax__AI has a welcome jump in agentic coding! It matches @Zai_org’s GLM-4.7 released yesterday, but at a lower cost.

December 23, 2025 at 10:04 PM

Thaddée Tyl

@espadrine.bsky.social

Impressive jump on agentic coding according to its benchmarks! Now on par with Claude Opus 4.1 (from 5 months ago!), K2 Thinking, and GPT-5.2 Codex, at a lower cost.

A bit overshadowed by DeepSeek, whose DSA mechanisms achieve great cost cuts.

December 23, 2025 at 1:33 PM

Thaddée Tyl

@espadrine.bsky.social

There are few benchmarks yet for @OpenAI’s fresh GPT 5.2 Codex model.

Initial benchmarks from the announcement imply a drop below Gemini 3 Flash in agentic coding. In fact, the performance seems close to DeepSeek V3.2 at a 50x price jump.

December 19, 2025 at 3:19 PM

Thaddée Tyl

@espadrine.bsky.social

Gemini 3 Flash from @GoogleDeepMind is a big step up in general knowledge, across a wide price range (modulated by the thinking time.

At the highest, it surpasses the recently-released GPT-5.2 in both cost and quality.

Jeff Dean @jeffdean.bsky.social · Dec 17

We’ve pushed out the Pareto frontier of efficiency vs. intelligence again.

With Gemini 3 Flash ⚡️, we are seeing reasoning capabilities previously reserved for our largest models. This opens up entirely new categories of near real-time applications that require complex thought.

More in thread ⬇️

December 18, 2025 at 1:48 PM

Thaddée Tyl

@espadrine.bsky.social

GPT-5.2 from @OpenAI has a marginal but welcome improvement in reasoning, securing its place at the very top.

December 18, 2025 at 10:56 AM

Thaddée Tyl

@espadrine.bsky.social

With Devstral 2, Mistral pushes the envelope on two fronts:
① what an agentic coding model can do, with no reasoning!

There’s a quality gap to reasoning models, expectedly. The positive: it is cheaper; potentially even cheaper in practice than indicated in this chart.

December 9, 2025 at 5:21 PM

Thaddée Tyl

@espadrine.bsky.social

The path of the Mistral 7B is nice to see!

The OG one topped open models of that size. For the first time, a local model felt usable on consumer hardware.

Not only is the latest Ministral 8B on the Pareto frontier for knowledge vs. cost (and for search, math, agentic uses)…

December 3, 2025 at 10:40 AM

Thaddée Tyl

@espadrine.bsky.social

DeepSeek released V3.2 (and V3.2 Speciale, a math-oriented model).

New model, new benchmarks!

The biggest jump for DeepSeek V3.2 is on agentic coding, where it seems poised to erase a lot of models on the Pareto frontier, including Sonnet 4.5, Minimax M2, and K2 Thinking.

December 1, 2025 at 6:28 PM

Thaddée Tyl

@espadrine.bsky.social

So, how is Gemini 3 on this new leaderboard?

Its intrinsic knowledge is unmatched, surpassing 2.5 and GPT-5.1.

bsky.app/profile/espa...

November 18, 2025 at 5:37 PM

Thaddée Tyl

@espadrine.bsky.social

Unveiling a new LLM leaderboard: metabench.organisons.com

Why?

Company C1 releases model M1 and discloses benchmarks B1.
Company C2 releases M2, showing off benchmarks B2 which are distinct.
Comparing those models is hard since they don't share benchmarks!

November 18, 2025 at 5:21 PM

Thaddée Tyl

@espadrine.bsky.social

Am I using the Gemini APIs wrong? I keep getting 429's. The key was fresh from aistudio.google.com.

gemini-embedding-exp-03-07 is the only embedding model in the market that I can’t benchmark because of it.

The quota in the Console says I'm at 0.33% usage…

June 30, 2025 at 8:14 AM

Reposted by Thaddée Tyl

Kyutai

@kyutai-labs.bsky.social

Our latest open-source speech-to-text model just claimed 1st place among streaming models and 5th place overall on the OpenASR leaderboard 🥇🎙️
While all other models need the whole audio, ours delivers top-tier accuracy on streaming content.
Open, fast, and ready for production!

June 27, 2025 at 10:31 AM

Thaddée Tyl

@espadrine.bsky.social

Isn’t there a better way to handle screens than asking a *language model* to guess the number of pixels to the left and top of a UI widget?

WARNING: Holo1 is using absolute coordinates (number of pixels) and HuggingFace processor is doing image resize. To have matching coordinates, one needs to smart_resize the image.

from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize

June 10, 2025 at 12:51 PM

Reposted by Thaddée Tyl

Kyutai

@kyutai-labs.bsky.social

Talk to unmute.sh 🔊, the most modular voice AI around. Empower any text LLM with voice, instantly, by wrapping it with our new speech-to-text and text-to-speech. Any personality, any voice. Interruptible, smart turn-taking. We’ll open-source everything within the next few weeks.

May 23, 2025 at 10:14 AM

Thaddée Tyl

@espadrine.bsky.social

Search > Recommendation.

I find more interesting, high-signal things from querying what I like, than linearly going through a feed that learnt from my navigation.

Generally, giving users the ability to send reliable signals beats extracting signals from their background noise.

May 18, 2025 at 11:27 AM

Reposted by Thaddée Tyl

Sara Hooker

@sarahooker.bsky.social

It is critical for scientific integrity that we trust our measure of progress.

The @lmarena.bsky.social has become the go-to evaluation for AI progress.

Our release today demonstrates the difficulty in maintaining fair evaluations on the Arena, despite best intentions.

April 30, 2025 at 2:55 PM

Thaddée Tyl

@espadrine.bsky.social

I wonder what the story was for Phi-4 Mini. Its tokenizer for conversation is completely different from Phi-4.

April 6, 2025 at 4:55 PM

Reposted by Thaddée Tyl

Ryan Williams

@rrwilliams.bsky.social

New paper: Simulating Time With Square-Root Space

people.csail.mit.edu/rrw/time-vs-...

It's still hard for me to believe it myself, but I seem to have shown that TIME[t] is contained in SPACE[sqrt{t log t}].

To appear in STOC. Comments are very welcome!

people.csail.mit.edu

February 21, 2025 at 10:19 PM

Thaddée Tyl

@espadrine.bsky.social

Censorship is when the government silences speech.

With Mr Musk being in government, doesn’t that make every X suspension or shadow ban, censorship?

March 24, 2025 at 2:25 AM

Thaddée Tyl

@espadrine.bsky.social

Preventing political opponents from joining elections, by removing their diploma and putting them in prison with unjustified charges, is not democratic.

Is there a shred of reason behind Ekrem Immamoglu's jailing?

apnews.com/article/turk...

Turkish court orders Erdogan rival jailed pending trial on corruption charges as protests grow

A Turkish court formally arrested Mayor Ekrem Imamoglu, a key rival to President Recep Tayyip Erdogan, and ordered him jailed pending the outcome of a trial on corruption charges.

apnews.com

March 24, 2025 at 1:23 AM

Reposted by Thaddée Tyl

Thomas Wolf

@thomwolf.bsky.social

We've kept pushing our Open-R1 project, an open initiative to replicate and extend the techniques behind DeepSeek-R1

And even we were mind-blown by the results we got with this latest model we're releasing: ⚡️OlympicCoder

[1/3]

March 12, 2025 at 1:22 PM

Thaddée Tyl

@espadrine.bsky.social

Is there an economic reason for which the tariffs established during Mr Trump’s first term didn’t cause a recession, but those established now did?

March 11, 2025 at 5:41 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news