Lightnews — Scholar-powered news

Julian Minder

@jkminder.bsky.social

Huge thanks to my amazing co-authors @butanium.bsky.social, Stewart Slocum, Helena Casademunt, @cameronholmes.bsky.social, Robert West @neelnanda.bsky.social
Paper: www.arxiv.org/abs/2510.13900
(9/9)

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We sh...

www.arxiv.org

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

Takeaways: ALWAYS mix in data when building model organisms that should serve as proxies for more naturally emerging behaviors. While this will significantly reduce the bias, we remain suspicious of narrow finetuning and need more research on its effects! (8/9)

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

A study of possible fixes shows that mixing in unrelated data during finetuning mostly removes the bias, but small factors remain. (7/9)

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

We further deep dive into why this happens by showing that the traces represent constant biases of the training data. Ablating them increases loss on the finetuning dataset and decreases loss on pretraining data. (6/9)

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

Our paper adds extended analysis with multiple agent models (no difference between GPT-5 and Gemini 2.5 Pro!) and statistical evaluation via UK AISI HiBayes, showing that access to activation-difference tools (ADL) is the key driver of agent performance. (5/9)

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

We then use interpretability agents to evaluate the claim that this information contains important insights into the finetuning objective - the agent with access to these tools significantly outperforms pure blackbox agents! (4/9)

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

Recap: We compute activation differences between a base and finetuned model on the first few tokens of unrelated text & inspect them with Patchscope and by steering the finetuned model with the differences. This reveals the semantics and structure of the finetuning data. (3/9)

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

Researchers often use narrowly finetuned models to practice: give them interesting properties and test their methods. It's key to use more realistic training schemes! We extend on our previous blogpost by providing more insights. (2/9) bsky.app/profile/jkmi...

Julian Minder @jkminder.bsky.social · Sep 5

Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

Further research into these organisms is needed, although our preliminary investigations suggest that solutions may be straightforward. We will continue to work on this and provide a more detailed analysis soon.

Blogpost: www.alignmentforum.org/posts/sBSjEB... (8/8)

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences — AI Alignment Forum

This is a preliminary research update. We are continuing our investigation and will publish a more in-depth analysis soon. The work was done as part…

www.alignmentforum.org

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

Takeaways: Narrow-finetuned “organisms” may poorly reflect broad, real-world training. They encode domain info that shows up even on unrelated inputs. (7/8)

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

Ablations: Mixing unrelated chat data or shrinking the finetune set weakens the signal—consistent with overfitting. (6/8)

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

Agent: The interpretability agent uses these signals to identify finetuning objectives with high accuracy by asking a few questions to the model to refine it’s hypothesis, outperforming black-box baselines. (5/8)

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

Result: Steering with these differences reproduces the finetuning data’s style and content on unrelated prompts. (4/8)

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

Result: Patchscope on these differences surfaces tokens tightly linked to the finetuning domain—no finetune data needed at inference. (3/8)

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

With @butanium.bsky.social @neelnanda.bsky.social Stewart Slocum
Setup: We compute per-position average activation differences between a base and finetuned model on unrelated text. Inspect with Patchscope and by steering the finetuned model with the differences. (2/8)

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

Paper: arxiv.org/pdf/2507.08802

arxiv.org

July 17, 2025 at 10:57 AM

Julian Minder

@jkminder.bsky.social

What does this mean? Causal Abstraction - while still a promising framework - must explicitly constrain representational structure or include the notion of generalization, since our proof hinges on the existence of an extremely overfitted function.
More detailed thread: bsky.app/profile/deni...

Denis Sutter @denissutter.bsky.social · Jul 15

1/9 In our new interpretability paper, we analyse causal abstraction—the framework behind Distributed Alignment Search—and show it breaks when we remove linearity constraints on feature representations. We refer to this problem as the Non-Linear Representation Dilemma.

July 17, 2025 at 10:57 AM

Julian Minder

@jkminder.bsky.social

Our proofs show that, without assuming the linear representation hypothesis, any algorithm can be mapped onto any network. Experiments confirm this: e.g. by using highly non-linear representations we can map an Indirect-Object-Identification algorithm to randomly initialized language models.

July 17, 2025 at 10:57 AM

Reposted by Julian Minder

Tiago Pimentel

@tpimentel.bsky.social

In this new paper, w/ @denissutter.bsky.social , @jkminder.bsky.social, and T.Hofmann, we study *causal abstraction*, a formal specification of when a deep neural network (DNN) implements an algorithm. This is the framework behind, e.g., distributed alignment search.

Paper: arxiv.org/abs/2507.08802

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level ...

arxiv.org

July 14, 2025 at 12:15 PM

Julian Minder

@jkminder.bsky.social

Could this have caught OpenAI's sycophantic model update? Maybe!

Post: lesswrong.com/posts/xmpauE...

Paper Thread: bsky.app/profile/buta...

Paper: arxiv.org/abs/2504.02922

June 30, 2025 at 9:02 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news