Mainly interested in Language Model Interpretability and Model Diffing.
MATS 7.0 Winter 2025 Scholar w/ Neel Nanda
jkminder.ch
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
We leveraged this recipe to find the 1D subspace in 3 different models: like Llama-3.1 , Mistral-v0.3 and Gemma-2.
We leveraged this recipe to find the 1D subspace in 3 different models: like Llama-3.1 , Mistral-v0.3 and Gemma-2.
Andy Arditi et al., @arthurconmy.bsky.social.
Andy Arditi et al., @arthurconmy.bsky.social.
@yuzhaouoe.bsky.social
,@pminervini.bsky.social has recently shown that steering a set of SAE vectors achieves something similar.
@yuzhaouoe.bsky.social
,@pminervini.bsky.social has recently shown that steering a set of SAE vectors achieves something similar.