Lightnews — Scholar-powered news

Julian Minder

@jkminder.bsky.social

Takeaways: ALWAYS mix in data when building model organisms that should serve as proxies for more naturally emerging behaviors. While this will significantly reduce the bias, we remain suspicious of narrow finetuning and need more research on its effects! (8/9)

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

A study of possible fixes shows that mixing in unrelated data during finetuning mostly removes the bias, but small factors remain. (7/9)

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

We further deep dive into why this happens by showing that the traces represent constant biases of the training data. Ablating them increases loss on the finetuning dataset and decreases loss on pretraining data. (6/9)

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

Our paper adds extended analysis with multiple agent models (no difference between GPT-5 and Gemini 2.5 Pro!) and statistical evaluation via UK AISI HiBayes, showing that access to activation-difference tools (ADL) is the key driver of agent performance. (5/9)

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

We then use interpretability agents to evaluate the claim that this information contains important insights into the finetuning objective - the agent with access to these tools significantly outperforms pure blackbox agents! (4/9)

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

Recap: We compute activation differences between a base and finetuned model on the first few tokens of unrelated text & inspect them with Patchscope and by steering the finetuned model with the differences. This reveals the semantics and structure of the finetuning data. (3/9)

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

Ablations: Mixing unrelated chat data or shrinking the finetune set weakens the signal—consistent with overfitting. (6/8)

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

Agent: The interpretability agent uses these signals to identify finetuning objectives with high accuracy by asking a few questions to the model to refine it’s hypothesis, outperforming black-box baselines. (5/8)

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

Result: Steering with these differences reproduces the finetuning data’s style and content on unrelated prompts. (4/8)

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

Result: Patchscope on these differences surfaces tokens tightly linked to the finetuning domain—no finetune data needed at inference. (3/8)

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

Our proofs show that, without assuming the linear representation hypothesis, any algorithm can be mapped onto any network. Experiments confirm this: e.g. by using highly non-linear representations we can map an Indirect-Object-Identification algorithm to randomly initialized language models.

July 17, 2025 at 10:57 AM

Julian Minder

@jkminder.bsky.social

Causal Abstraction, the theory behind DAS, tests if a network realizes a given algorithm. We show (w/ @denissutter.bsky.social, T. Hofmann, @tpimentel.bsky.social ) that the theory collapses without the linear representation hypothesis—a problem we call the non-linear representation dilemma.

July 17, 2025 at 10:57 AM

Julian Minder

@jkminder.bsky.social

Our methods reveal interpretable features related to e.g. refusal detection, fake facts, or information about the model's identity. This highlights that model diffing is a promising research direction deserving more attention.

June 30, 2025 at 9:02 PM

Julian Minder

@jkminder.bsky.social

With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

June 30, 2025 at 9:02 PM

Julian Minder

@jkminder.bsky.social

In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.

April 7, 2025 at 5:56 PM

Julian Minder

@jkminder.bsky.social

9/ We further examine the models that have been fine-tuned for this task and find evidence that the fine-tuning appears learn how to set the knob that already exists in the model.

November 22, 2024 at 3:49 PM

Julian Minder

@jkminder.bsky.social

8/ 4. Learn a subspace to control the behavior in the found layer based on ideas from Distributed Alignment Search by Geiger et al..
We leveraged this recipe to find the 1D subspace in 3 different models: like Llama-3.1 , Mistral-v0.3 and Gemma-2.

November 22, 2024 at 3:49 PM

Julian Minder

@jkminder.bsky.social

7/ We propose a recipe to analyse such phenomena: 1. Design a task of binary nature. 2. Finetune a model on this task 3. Leverage the binary nature of the task and activation patching and the patchscope (Ghandeharioun,@cluavi.bsky.social,@megamor2.bsky.social) to identify relevant layers.

November 22, 2024 at 3:49 PM

Julian Minder

@jkminder.bsky.social

6/ This lines up with other recent works that have shown that structure found in the instruction tuned/finetuned models can be transferred to the base model, such as the refusal vector as shown by
Andy Arditi et al., @arthurconmy.bsky.social.

November 22, 2024 at 3:49 PM

Julian Minder

@jkminder.bsky.social

5/ Using mechanistic tools, we found a 1D subspace in one layer that controls this behavior across model versions—even without fine-tuning! Concurrent work by
@yuzhaouoe.bsky.social
,@pminervini.bsky.social has recently shown that steering a set of SAE vectors achieves something similar.

November 22, 2024 at 3:49 PM

Julian Minder

@jkminder.bsky.social

4/ After fine-tuning on this task, we discover that these models can hit an accuracy of 85-95%, showing they can reliably switch between context and prior answers. 🎯

November 22, 2024 at 3:49 PM

Julian Minder

@jkminder.bsky.social

3/ We give the model a false context (e.g., "Paris is in England") and a question ("Where is Paris?") – and then see if we can tell it to answer using either context or prior knowledge, a setup similar to DisentQA (Neeman et al., @lchoshen.bsky.social)

November 22, 2024 at 3:49 PM

Julian Minder

@jkminder.bsky.social

2/ We dive into this question, looking for a "context sensitivity knob" — a simple mechanism that controls whether LLMs (like Llama-3.1, Mistral-v0.3, Gemma-2) rely on context vs. prior knowledge.

November 22, 2024 at 3:49 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news