@profsophie.bsky.social! Also, Boston is quite nice :)
@profsophie.bsky.social! Also, Boston is quite nice :)
📜 Preprint: arxiv.org/abs/2511.01836
🧠 Play with temporal feature analysis on Neuronpedia: www.neuronpedia.org/gemma-2-2b/1...
📜 Preprint: arxiv.org/abs/2511.01836
🧠 Play with temporal feature analysis on Neuronpedia: www.neuronpedia.org/gemma-2-2b/1...
Our work demonstrates that this lesson applies equally well to interpretability methods. The inductive biases of interp methods should reflect the structure of what is being studied.
Our work demonstrates that this lesson applies equally well to interpretability methods. The inductive biases of interp methods should reflect the structure of what is being studied.
Meanwhile, TFA recovers the correct parse given the context!
Meanwhile, TFA recovers the correct parse given the context!
To edit or steer, we cannot simply choose semantically relevant representations; we must choose the ones that will have the intended impact. As @peterbhase.bsky.social found, these are often distinct.
To edit or steer, we cannot simply choose semantically relevant representations; we must choose the ones that will have the intended impact. As @peterbhase.bsky.social found, these are often distinct.
📜 arxiv.org/abs/2504.13151
📜 arxiv.org/abs/2504.13151