Benjamin Lefaudeux 🇺🇦
bentheegg.bsky.social
Benjamin Lefaudeux 🇺🇦
@bentheegg.bsky.social
Back to France after some time in sunny California and happy Copenhagen. Mistral, Photoroom, Meta (xformers, FairScale, R&D), EyeTribe (acq) Mostly writing around AI
Above is intuitive when you think about it long enough (or so it feels at least), but I missed it entirely during a couple of years working on diffusion, so I figured it was worth emphasizing and the authors did too :)
July 26, 2025 at 10:19 PM
Worth a deep read in general, not personally completely done with it, I hope it ages well. Closing with some nice insight wrt diffusion models: they don't open up for serial awareness, since model iterates on _the same_ solution, no state space + carry over. _Less_ powerful than autoregressive
July 26, 2025 at 10:19 PM
Paper cannot prove its point completely since models are really good approximators, and used as such (hence a formal disprove is not enough). Pretty good hints still, makes me confident we're far from peak efficiency in most use cases (we approx serial awareness by adding tons of compute)
July 26, 2025 at 10:16 PM
I think that hardware recommendations are a little naive/premature, as much as I like CPUs nothing will happen prior to needs and solutions being put on the table. Lowering is expensive and risky in general, will happen last, but at least this shows there's kryptonite to GPU dominance
July 26, 2025 at 10:10 PM
The paper is very pedagogical, and some takeaways ring pretty reasonable. Intuition is interesting behind LLMs being just ok to not great Chess players (missing the MCTS like mechanism of specialized models), or failing to be effective at multi step reasoning prior to test time compute / CoT
July 26, 2025 at 10:08 PM
It then feels like the dichotomy proposed by the paper (inherently parallel and TC0 models will fail on serial problems) is excessive, or at least that the frontier is a bit fuzzy. One line is great though, paraphrasing "only with test time compute did we factor in some serial compute power"
July 26, 2025 at 10:05 PM
There are caveats in the definition of "inherently serial" problems:
- not all solutions will require serial computations, even for something outside of TC0
- approximations can fall pretty close, and oftentimes we don´t expect anything much better than an approximation
July 26, 2025 at 10:03 PM
Claude Code is really good for some narrowly defined tasks (add unit tests for instance), and in that case it's clearly an agent. The "vibe coding" coding middle ground (with somebody in the loop who doesn't completely get it) is the part on shaky grounds I believe
July 18, 2025 at 8:44 PM
Something the LLMs have not seen beforehand (new model architecture for instance). In my experience that's where all the current tools break, for relatable reasons. I guess it's the same for somebody developing a SOTA DB engine or computer shader
July 18, 2025 at 8:41 PM
For things LLMs are not great at (typically new, frontier work) you're better off doing it instead of inheriting a broken spaghetti plate. Vibe coding your way to oblivion is not a great proposition for either of these. I don't think there's that much of a middle ground
July 18, 2025 at 9:55 AM
Qualitatively the chunking is real and meaningful
July 14, 2025 at 4:33 PM
I was a bit short on the results in this thread re:HNets, they are pretty convincing even if taking over transformers will take more validation. Of note the models become naturally robust to typos, which is a great omen
July 14, 2025 at 4:31 PM
Well you can read my thread, else the link is in the first post :) model weights are open
July 14, 2025 at 4:25 PM
HNets is chunking dynamically, that's why it's a big deal for me ! Else byte latents was doing that already, so not exactly nothing but not entirely mature, yes
July 14, 2025 at 3:50 PM
comparisons with diffusion models are not a complete hit, because the comparison is with undistilled, 1000-steps models, which nobody uses in their right mind (fast samplers & distilled models mean that images are clean in 4-8 steps, 30 tops). The fact that EBT is usable as is is already great
July 13, 2025 at 7:40 AM
Similarly to HNets I think the proof will be in the scaling, but there are good omens, where the technique works as you would expect it to. For instance, thinking more on out-of-distribution data has a bigger impact than on in-distribution (assuming the model was big enough to capture training set)
July 13, 2025 at 7:34 AM
the big result is in the thinking, in that by opening up the compute valves for the more complicated cases has a meaningful effect.

Note that there's a interesting operating mode attached to being able to self-assess: generate multiple options then pick the better one (self-monte carlo ?)
July 13, 2025 at 7:32 AM