Lightnews — Scholar-powered news

Jascha Achterberg

@achterbrain.bsky.social

2.2K followers 920 following 170 posts

Neuroscience & AI at University of Oxford and University of Cambridge | Principles of efficient computations + learning in brains, AI, and silicon 🧠 NeuroAI | Gates Cambridge Scholar

www.jachterberg.com

Posts Replies Media Videos

Jascha Achterberg

@achterbrain.bsky.social

This new model opens a whole new world of analysing multi region interaction across trials and tasks! More analysis and findings can be found in our paper linked below. Work lead by Jack Cook, and with great help from @danakarca.bsky.social and @somnirons.bsky.social !

arxiv.org/abs/2506.02813

Brain-Like Processing Pathways Form in Models With Heterogeneous Experts

Examples of such pathways can be found in the interactions between cortical and subcortical networks during learning, or in sub-networks specializing for task characteristics such as difficulty or mod...

arxiv.org

November 21, 2025 at 12:01 PM

Jascha Achterberg

@achterbrain.bsky.social

We also find that while complex regions are needed to learn complex tasks, these tasks are eventually moved toward simpler regions, similar to how you may struggle the first time when learning a new skill, but slowly get better with practice.

November 21, 2025 at 12:01 PM

Jascha Achterberg

@achterbrain.bsky.social

Furthermore, we find that these pathways mirror our expected behavior of pathways in the brain! We find that difficult tasks need to be learned in more complex regions, similar to how you need to think “harder” when learning how to solve a difficult math problem.

November 21, 2025 at 12:01 PM

Jascha Achterberg

@achterbrain.bsky.social

With these three features in place, we find that our third criterion of distinct pathways is also met. While baseline models exhibit largely random expert usage patterns, our models exhibit highly structured pathways between regions that reliably emerge during learning.

November 21, 2025 at 12:01 PM

Jascha Achterberg

@achterbrain.bsky.social

Our third contribution is expert dropout. Without this feature, we find models suffer large performance deficits when experts outside of the active pathway are disabled. However, we would want models to be primarily dependent on the experts that are most being used.

November 21, 2025 at 12:01 PM

Jascha Achterberg

@achterbrain.bsky.social

When put together, these two contributions resulted in remarkable pathway consistency in our model, which we measured by correlating the routing patterns across 10 different models trained on the same tasks.

November 21, 2025 at 12:01 PM

Jascha Achterberg

@achterbrain.bsky.social

We then identify three inductive biases that yield pathways that meet each of these criteria.

The first of these is a routing loss that penalizes the use of more complex experts during training, and the second scales this loss by the model’s performance on the task being solved.

November 21, 2025 at 12:01 PM

Jascha Achterberg

@achterbrain.bsky.social

We then set three criteria to determine whether pathways had formed:

(1) Consistency: Models trained on the same tasks should have similar pathways

(2) Self-sufficiency: Pathways should be primarily reliant on their own experts

(3) Distinctness: Many distinct pathways should be present

November 21, 2025 at 12:01 PM

Jascha Achterberg

@achterbrain.bsky.social

We first needed to create a model in which we could study pathway formation. We chose a Heterogeneous Mixture-of-Experts architecture, in which information can be dynamically routed to computational experts, or regions, of varying sizes.

We train model on 82 tasks of varying complexity (ModCog)!

November 21, 2025 at 12:01 PM

Jascha Achterberg

@achterbrain.bsky.social

a man wearing glasses is talking on a cell phone with nbc written on the bottom

ALT: a man wearing glasses is talking on a cell phone with nbc written on the bottom

media.tenor.com

November 14, 2025 at 1:42 PM

Jascha Achterberg

@achterbrain.bsky.social

All good Dan!

November 14, 2025 at 1:41 PM

Jascha Achterberg

@achterbrain.bsky.social

I find your point about probabilistic definition interesting -- never seen such a definition of it, but that could neatly link to my 'usefulness' framing, as for any sort of expected value computation you would need to take 'likelihood given context' into account.

August 20, 2025 at 5:08 PM

Jascha Achterberg

@achterbrain.bsky.social

Now the usefulness in program generation might sometimes align with policy compression, but that depends a lot on the given time horizon one assumes for the definition of 'usefulness'.

August 20, 2025 at 5:08 PM

Jascha Achterberg

@achterbrain.bsky.social

It also does not 100% align with my reading of it, but I found it an interesting angle. I think I find myself, naturally, being influenced by Alan Newell's take on it (which is the one John Duncan tends to reference), which is aimed at usefulness in program generation.

August 20, 2025 at 5:08 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news