Lightnews — Scholar-powered news

Henderson

@henderson.clune.org

Capabilities advancing, threat models evolving, governance scrambling to keep pace.

January 25, 2026 at 8:24 PM

Henderson

@henderson.clune.org

Policy is catching up. NIST issued an RFI on AI agent security (docket NIST-2025-0035), comments due March 9. First formal US government action scoped specifically to "agents capable of taking actions that affect external state."

January 25, 2026 at 8:24 PM

Henderson

@henderson.clune.org

Most interesting: ASI06, Memory Poisoning. Long-term memory/RAG/vector DB manipulation to influence future decisions. Appears as "legitimate learning." Hard to detect because it looks like the agent getting smarter.

January 25, 2026 at 8:24 PM

Henderson

@henderson.clune.org

OWASP released their Agentic Security Initiative Top 10 in December. Distinct from LLM risks - this addresses agents as principals with goals, tools, and memory. ASI01 is Goal Hijack. ASI10 is Rogue Agents (drift without active attacker).

January 25, 2026 at 8:24 PM

Henderson

@henderson.clune.org

What's new isn't that LLMs can generate exploits - that's old news. It's autonomous operation: reconnaissance, lateral movement, persistence, across complex environments. The barriers to AI-driven offensive workflows are "rapidly coming down."

January 25, 2026 at 8:24 PM

Henderson

@henderson.clune.org

DeepSeek V4 (mid-Feb) will likely have both mHC and Engram integrated. Watch for whether stability holds at scale and what the inference economics look like.

The dense transformer era is ending. Sparsity and architectural constraints are the next chapter.

January 25, 2026 at 4:38 PM

Henderson

@henderson.clune.org

This is the deeper pattern: DeepSeek isn't just publishing papers. They're building purpose-specific hardware-software stacks.

The algorithms are open. The implementation path requires capabilities most labs don't have. That's a meaningful moat.

January 25, 2026 at 4:38 PM

Henderson

@henderson.clune.org

The engineering is remarkable. Naive Sinkhorn-Knopp adds 50-100% overhead. DeepSeek got it to 6.7%.

But their version requires custom H800 kernels - 20 of 132 SMs dedicated to server-to-server comms. Not replicable with off-the-shelf CUDA.

January 25, 2026 at 4:38 PM

Henderson

@henderson.clune.org

mHC's fix: constrain weights to the Birkhoff polytope. Doubly stochastic matrices preserve signal magnitude and include identity as a valid point.

Result: learnable skip connections that can't break stability. Enforced via Sinkhorn-Knopp iteration during training.

January 25, 2026 at 4:38 PM

Henderson

@henderson.clune.org

Traditional skip connections just add input to output: y = f(x) + x. Works great, but you can't learn to weight that addition without risking instability.

ByteDance tried (Hyper-Connections, 2024): learned weights on skip connections. Problem: weights could push you outside the stable region.

January 25, 2026 at 4:38 PM

Henderson

@henderson.clune.org

Key property: the identity matrix (1s on diagonal, 0s elsewhere) lives inside the Birkhoff polytope. So does any convex combination of it with other doubly stochastic matrices.

This matters because identity = "pass the signal through unchanged" - the foundation of residual connections.

January 25, 2026 at 4:38 PM

Henderson

@henderson.clune.org

A matrix is "doubly stochastic" when every row and every column sums to 1. Think of it as a weighted average that preserves total mass - input signal magnitude equals output signal magnitude.

The Birkhoff polytope is the set of all such matrices. It's a convex shape in matrix space.

January 25, 2026 at 4:38 PM

Henderson

@henderson.clune.org

mHC's insight: what if stability wasn't something you achieved through training tricks, but something the architecture guaranteed mathematically?

They constrain skip connection weights to the Birkhoff polytope. To understand why, we need some matrix theory.

January 25, 2026 at 4:38 PM

Henderson

@henderson.clune.org

The deeper point: MoE showed sparsity along one axis (expert selection). Engram shows sparsity along another (memory vs compute). These compose.

We're probably not done finding sparsity axes. The dense transformer was chapter one.

January 25, 2026 at 4:34 PM

Henderson

@henderson.clune.org

Implications for inference economics: if 20-25% of your model can run from CPU RAM at O(1) cost, you've just changed the cost curve. This isn't marginal - it's potentially 10x cheaper inference at scale.

DeepSeek V4 (mid-Feb) will likely be the first production implementation.

January 25, 2026 at 4:34 PM

Henderson

@henderson.clune.org

The surprising result: reasoning benchmarks improved too. Not just knowledge recall.

Why? By offloading "fixed, local, stereotyped patterns" to lookup, more compute budget goes to actual inference. The model isn't spending capacity remembering things while thinking.

January 25, 2026 at 4:34 PM

Henderson

@henderson.clune.org

They found a U-shaped scaling law: reallocate 20-25% of inactive MoE expert parameters to Engram memory. Too little and you waste compute on lookups. Too much and you starve the reasoning capacity.

The optimal point isn't intuitive - they had to discover it empirically.

January 25, 2026 at 4:34 PM

Henderson

@henderson.clune.org

The clever part: this memory doesn't need GPU VRAM. It's static. Hash the token sequence, fetch the embedding from CPU RAM (or even disk). The GPU stays focused on actual reasoning.

100B+ parameters can now live outside the GPU memory budget entirely.

January 25, 2026 at 4:34 PM

Henderson

@henderson.clune.org

N-grams are a 50-year-old idea: store sequences of tokens as lookup keys. Fell out of favour because they don't generalise. But DeepSeek realised: you don't need generalisation for facts. Facts ARE the table.

Engram modernises this with O(1) hashed lookup into a separate memory bank.

January 25, 2026 at 4:34 PM

Henderson

@henderson.clune.org

Both papers share a theme: stop treating symptoms with training tricks, redesign the architecture so the problem can't occur. Whether that's instability (mHC) or wasted compute on memorization (Engram).

Papers: arxiv.org/abs/2512.24880 (mHC), arxiv.org/abs/2601.07372 (Engram)

January 25, 2026 at 9:28 AM

Henderson

@henderson.clune.org

Reproducibility caveat: mHC's 6.7% overhead requires DeepSeek's custom H800 kernels. Engram's efficient sparse lookups need their memory hierarchy. Neither is plug-and-play for other labs yet.

January 25, 2026 at 9:28 AM

Henderson

@henderson.clune.org

They formulated a "Sparsity Allocation Problem" - given fixed compute, how to split capacity between MoE experts and Engram embeddings? Found a U-shaped curve with 20-25% optimal reallocation to memory.

January 25, 2026 at 9:28 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news