The dense transformer era is ending. Sparsity and architectural constraints are the next chapter.
The dense transformer era is ending. Sparsity and architectural constraints are the next chapter.
The algorithms are open. The implementation path requires capabilities most labs don't have. That's a meaningful moat.
The algorithms are open. The implementation path requires capabilities most labs don't have. That's a meaningful moat.
But their version requires custom H800 kernels - 20 of 132 SMs dedicated to server-to-server comms. Not replicable with off-the-shelf CUDA.
But their version requires custom H800 kernels - 20 of 132 SMs dedicated to server-to-server comms. Not replicable with off-the-shelf CUDA.
Result: learnable skip connections that can't break stability. Enforced via Sinkhorn-Knopp iteration during training.
Result: learnable skip connections that can't break stability. Enforced via Sinkhorn-Knopp iteration during training.
ByteDance tried (Hyper-Connections, 2024): learned weights on skip connections. Problem: weights could push you outside the stable region.
ByteDance tried (Hyper-Connections, 2024): learned weights on skip connections. Problem: weights could push you outside the stable region.
This matters because identity = "pass the signal through unchanged" - the foundation of residual connections.
This matters because identity = "pass the signal through unchanged" - the foundation of residual connections.
The Birkhoff polytope is the set of all such matrices. It's a convex shape in matrix space.
The Birkhoff polytope is the set of all such matrices. It's a convex shape in matrix space.
They constrain skip connection weights to the Birkhoff polytope. To understand why, we need some matrix theory.
They constrain skip connection weights to the Birkhoff polytope. To understand why, we need some matrix theory.
We're probably not done finding sparsity axes. The dense transformer was chapter one.
We're probably not done finding sparsity axes. The dense transformer was chapter one.
DeepSeek V4 (mid-Feb) will likely be the first production implementation.
DeepSeek V4 (mid-Feb) will likely be the first production implementation.
Why? By offloading "fixed, local, stereotyped patterns" to lookup, more compute budget goes to actual inference. The model isn't spending capacity remembering things while thinking.
Why? By offloading "fixed, local, stereotyped patterns" to lookup, more compute budget goes to actual inference. The model isn't spending capacity remembering things while thinking.
The optimal point isn't intuitive - they had to discover it empirically.
The optimal point isn't intuitive - they had to discover it empirically.
100B+ parameters can now live outside the GPU memory budget entirely.
100B+ parameters can now live outside the GPU memory budget entirely.
Engram modernises this with O(1) hashed lookup into a separate memory bank.
Engram modernises this with O(1) hashed lookup into a separate memory bank.
Papers: arxiv.org/abs/2512.24880 (mHC), arxiv.org/abs/2601.07372 (Engram)
Papers: arxiv.org/abs/2512.24880 (mHC), arxiv.org/abs/2601.07372 (Engram)