Lightnews — Scholar-powered news

Reposted by Aaron Mueller

Najoung Kim

@najoung.bsky.social

I also want to mention that the lang x computation research community at BU is growing in an exciting direction, especially with new faculty like @amuuueller.bsky.social, @anthonyyacovone.bsky.social, @nsaphra.bsky.social, &
@profsophie.bsky.social! Also, Boston is quite nice :)

November 19, 2025 at 5:20 PM

Aaron Mueller

@amuuueller.bsky.social

Check out the paper and our demo features!

📜 Preprint: arxiv.org/abs/2511.01836
🧠 Play with temporal feature analysis on Neuronpedia: www.neuronpedia.org/gemma-2-2b/1...

Priors in Time: Missing Inductive Biases for Language Model Interpretability

Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions,...

arxiv.org

November 14, 2025 at 3:48 PM

Aaron Mueller

@amuuueller.bsky.social

I'm glossing over our deeper motivations from neuroscience (predictive coding) and linguistics here, but we believe there's significant cross-field appeal for those interested in intersections of cog sci, neuroscience, and machine learning!

November 14, 2025 at 3:48 PM

Aaron Mueller

@amuuueller.bsky.social

Jeff Elman famously showed us in 1990 that time is a rich signal in itself.

Our work demonstrates that this lesson applies equally well to interpretability methods. The inductive biases of interp methods should reflect the structure of what is being studied.

November 14, 2025 at 3:48 PM

Aaron Mueller

@amuuueller.bsky.social

TFA is designed to capture context-sensitive information. Consider parsing: SAE often assign each word to its most frequent syntactic category, regardless of context.

Meanwhile, TFA recovers the correct parse given the context!

November 14, 2025 at 3:48 PM

Aaron Mueller

@amuuueller.bsky.social

Aruna Sankaranarayanan, @arnabsensharma.bsky.social absensharma.bsky.social @ericwtodd.bsky.social ky.social @davidbau.bsky.social u.bsky.social @boknilev.bsky.social (2/2)

October 1, 2025 at 2:03 PM

Aaron Mueller

@amuuueller.bsky.social

Thanks again to the co-authors! Such a wide survey required a lot of perspectives. @jannikbrinkmann.bsky.social Millicent Li, Samuel Marks, @koyena.bsky.social @nikhil07prakash.bsky.social @canrager.bsky.social (1/2)

October 1, 2025 at 2:03 PM

Aaron Mueller

@amuuueller.bsky.social

See the paper for more details! arxiv.org/abs/2408.01416

The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis

Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not...

arxiv.org

October 1, 2025 at 2:03 PM

Aaron Mueller

@amuuueller.bsky.social

We also made the causal graph formalism more precise. Interpretability and causality are intimately linked; the latter makes the former more trustworthy and rigorous. This formal link should be strengthened in future work.

October 1, 2025 at 2:03 PM

Aaron Mueller

@amuuueller.bsky.social

One of the bigger changes was establishing criteria for success in interpretability. What units of analysis should you use if you know what you’re looking for? If you *don’t* know what you’re looking for?

October 1, 2025 at 2:03 PM

Aaron Mueller

@amuuueller.bsky.social

We still have a lot to learn in editing NN representations.

To edit or steer, we cannot simply choose semantically relevant representations; we must choose the ones that will have the intended impact. As @peterbhase.bsky.social found, these are often distinct.

May 27, 2025 at 5:07 PM

Aaron Mueller

@amuuueller.bsky.social

By limiting steering to output features, we recover >90% of the performance of the best supervised representation-based steering methods—and at some locations, we outperform them!

May 27, 2025 at 5:07 PM

Aaron Mueller

@amuuueller.bsky.social

We define the notion of an “output feature”, whose role is to increase p(some token(s)). Steering these gives better results than steering “input features”, whose role is to attend to concepts in the input. We propose fast methods to sort features into these categories.

May 27, 2025 at 5:07 PM

Reposted by Aaron Mueller

Ethan Gotlieb Wilcox

@wegotlieb.bsky.social

Couldn’t be happier to have co-authored this will a stellar team, including: Michael Hu, @amuuueller.bsky.social, @alexwarstadt.bsky.social, @lchoshen.bsky.social, Chengxu Zhuang, @adinawilliams.bsky.social, Ryan Cotterell, @tallinzen.bsky.social

May 12, 2025 at 3:48 PM

Aaron Mueller

@amuuueller.bsky.social

... Jing Huang, Rohan Gupta, Yaniv Nikankin, @hadasorgad.bsky.social, Nikhil Prakash, @anja.re, Aruna Sankaranarayanan, Shun Shao, @alestolfo.bsky.social, @mtutek.bsky.social, @amirzur, @davidbau.bsky.social, and @boknilev.bsky.social!

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarah-nlp.bsky.social, @danaarad.bsky.social, Iván Arcuschin, @adambelfki.bsky.social, @yiksiu.bsky.social, Jaden Fiotto-Kaufmann, @talhaklay.bsky.social, @michaelwhanna.bsky.social, ...

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

We’re eager to establish MIB as a meaningful and lasting standard for comparing the quality of MI methods. If you’ll be at #ICLR2025 or #NAACL2025, please reach out to chat!

📜 arxiv.org/abs/2504.13151

MIB: A Mechanistic Interpretability Benchmark

How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spann...

arxiv.org

April 23, 2025 at 6:15 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news