Thaddée Tyl
banner
espadrine.bsky.social
Thaddée Tyl
@espadrine.bsky.social
Self-replicating organisms. shields.io, Captain Train, Qonto. They.
Where GLM-4.7 Flash shines is when you feed it enormous inputs.
That is typical of agentic coding tools. It’s on the Pareto frontier there.

Better than GPT-OSS 20B, cheaper and faster than Devstral Small 2.
GLM-4.7-Flash — a 30B-A3B

Fits on a Macbook, does phenomenal on agentic & coding benchmarks

huggingface.co/zai-org/GLM-...
January 23, 2026 at 2:59 PM
What happened to Z.ai servers in December, for them to suddenly have a spikey boost in token throughput?!
January 17, 2026 at 11:47 PM
Reposted by Thaddée Tyl
One of my favorite findings: Positional embeddings are just training wheels. They help convergence but hurt long-context generalization.

We found that if you simply delete them after pretraining and recalibrate for <1% of the original budget, you unlock massive context windows. Smarter, not harder.
Introducing DroPE: Extending Context by Dropping Positional Embeddings

We found embeddings like RoPE aid training but bottleneck long-sequence generalization. Our solution’s simple: treat them as a temporary training scaffold, not a permanent necessity.

arxiv.org/abs/2512.12167
pub.sakana.ai/DroPE
January 12, 2026 at 4:12 AM
M2.1 from @MiniMax__AI has a welcome jump in agentic coding! It matches @Zai_org’s GLM-4.7 released yesterday, but at a lower cost.
December 23, 2025 at 10:04 PM
Impressive jump on agentic coding according to its benchmarks! Now on par with Claude Opus 4.1 (from 5 months ago!), K2 Thinking, and GPT-5.2 Codex, at a lower cost.

A bit overshadowed by DeepSeek, whose DSA mechanisms achieve great cost cuts.
December 23, 2025 at 1:33 PM
There are few benchmarks yet for @OpenAI’s fresh GPT 5.2 Codex model.

Initial benchmarks from the announcement imply a drop below Gemini 3 Flash in agentic coding. In fact, the performance seems close to DeepSeek V3.2 at a 50x price jump.
December 19, 2025 at 3:19 PM
Gemini 3 Flash from @GoogleDeepMind is a big step up in general knowledge, across a wide price range (modulated by the thinking time.

At the highest, it surpasses the recently-released GPT-5.2 in both cost and quality.
We’ve pushed out the Pareto frontier of efficiency vs. intelligence again.

With Gemini 3 Flash ⚡️, we are seeing reasoning capabilities previously reserved for our largest models. This opens up entirely new categories of near real-time applications that require complex thought.

More in thread ⬇️
December 18, 2025 at 1:48 PM
GPT-5.2 from @OpenAI has a marginal but welcome improvement in reasoning, securing its place at the very top.
December 18, 2025 at 10:56 AM
With Devstral 2, Mistral pushes the envelope on two fronts:
① what an agentic coding model can do, with no reasoning!

There’s a quality gap to reasoning models, expectedly. The positive: it is cheaper; potentially even cheaper in practice than indicated in this chart.
December 9, 2025 at 5:21 PM
The path of the Mistral 7B is nice to see!

The OG one topped open models of that size. For the first time, a local model felt usable on consumer hardware.

Not only is the latest Ministral 8B on the Pareto frontier for knowledge vs. cost (and for search, math, agentic uses)…
December 3, 2025 at 10:40 AM
DeepSeek released V3.2 (and V3.2 Speciale, a math-oriented model).

New model, new benchmarks!

The biggest jump for DeepSeek V3.2 is on agentic coding, where it seems poised to erase a lot of models on the Pareto frontier, including Sonnet 4.5, Minimax M2, and K2 Thinking.
December 1, 2025 at 6:28 PM
So, how is Gemini 3 on this new leaderboard?

Its intrinsic knowledge is unmatched, surpassing 2.5 and GPT-5.1.

bsky.app/profile/espa...
November 18, 2025 at 5:37 PM
Unveiling a new LLM leaderboard: metabench.organisons.com

Why?

Company C1 releases model M1 and discloses benchmarks B1.
Company C2 releases M2, showing off benchmarks B2 which are distinct.
Comparing those models is hard since they don't share benchmarks!
November 18, 2025 at 5:21 PM
Am I using the Gemini APIs wrong? I keep getting 429's. The key was fresh from aistudio.google.com.

gemini-embedding-exp-03-07 is the only embedding model in the market that I can’t benchmark because of it.

The quota in the Console says I'm at 0.33% usage…
June 30, 2025 at 8:14 AM
Reposted by Thaddée Tyl
Our latest open-source speech-to-text model just claimed 1st place among streaming models and 5th place overall on the OpenASR leaderboard 🥇🎙️
While all other models need the whole audio, ours delivers top-tier accuracy on streaming content.
Open, fast, and ready for production!
June 27, 2025 at 10:31 AM
Isn’t there a better way to handle screens than asking a *language model* to guess the number of pixels to the left and top of a UI widget?
June 10, 2025 at 12:51 PM
Reposted by Thaddée Tyl
Talk to unmute.sh 🔊, the most modular voice AI around. Empower any text LLM with voice, instantly, by wrapping it with our new speech-to-text and text-to-speech. Any personality, any voice. Interruptible, smart turn-taking. We’ll open-source everything within the next few weeks.
May 23, 2025 at 10:14 AM
Search > Recommendation.

I find more interesting, high-signal things from querying what I like, than linearly going through a feed that learnt from my navigation.

Generally, giving users the ability to send reliable signals beats extracting signals from their background noise.
May 18, 2025 at 11:27 AM
Reposted by Thaddée Tyl
It is critical for scientific integrity that we trust our measure of progress.

The @lmarena.bsky.social has become the go-to evaluation for AI progress.

Our release today demonstrates the difficulty in maintaining fair evaluations on the Arena, despite best intentions.
April 30, 2025 at 2:55 PM
I wonder what the story was for Phi-4 Mini. Its tokenizer for conversation is completely different from Phi-4.
April 6, 2025 at 4:55 PM
Reposted by Thaddée Tyl
New paper: Simulating Time With Square-Root Space

people.csail.mit.edu/rrw/time-vs-...

It's still hard for me to believe it myself, but I seem to have shown that TIME[t] is contained in SPACE[sqrt{t log t}].

To appear in STOC. Comments are very welcome!
people.csail.mit.edu
February 21, 2025 at 10:19 PM
Censorship is when the government silences speech.

With Mr Musk being in government, doesn’t that make every X suspension or shadow ban, censorship?
March 24, 2025 at 2:25 AM
Preventing political opponents from joining elections, by removing their diploma and putting them in prison with unjustified charges, is not democratic.

Is there a shred of reason behind Ekrem Immamoglu's jailing?

apnews.com/article/turk...
Turkish court orders Erdogan rival jailed pending trial on corruption charges as protests grow
A Turkish court formally arrested Mayor Ekrem Imamoglu, a key rival to President Recep Tayyip Erdogan, and ordered him jailed pending the outcome of a trial on corruption charges.
apnews.com
March 24, 2025 at 1:23 AM
Reposted by Thaddée Tyl
We've kept pushing our Open-R1 project, an open initiative to replicate and extend the techniques behind DeepSeek-R1

And even we were mind-blown by the results we got with this latest model we're releasing: ⚡️OlympicCoder

[1/3]
March 12, 2025 at 1:22 PM
Is there an economic reason for which the tariffs established during Mr Trump’s first term didn’t cause a recession, but those established now did?
March 11, 2025 at 5:41 PM