Chris Wendler
wendlerc.bsky.social
Chris Wendler
@wendlerc.bsky.social
Postdoc at the interpretable deep learning lab at Northeastern University, deep learning, LLMs, mechanistic interpretability
Pinned
In case you ever wondered what you could do if you had SAEs for intermediate results of diffusion models, we trained SDXL Turbo SAEs on 4 blocks for you. We noticed that they specialize into a "composition", a "detail", and a "style" block. And one that is hard to make sense of.
Reposted by Chris Wendler
Our mech interp ICML workshop paper got accepted to ACL 2025 main! 🎉
In this updated version, we extended our results to several models and showed they can actually generate good definitions of mean concept representations across languages.🧵
Clément Dumas on X: "Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️ https://t.co/dDS5iv199i" / X
Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️ https://t.co/dDS5iv199i
x.com
June 29, 2025 at 11:07 PM
Reposted by Chris Wendler
Can we uncover the list of topics a language model is censored on?

Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:
June 13, 2025 at 3:59 PM
Reposted by Chris Wendler
I am really proud to share our work led by Nikhil Prakash and in collaboration with more mechanistic interpretability and Theory of Mind (ToM) researchers:
arxiv.org/abs/2505.14685
You can find a tweet here with nice animations:
x.com/nikhil07prak...
Language Models use Lookbacks to Track Beliefs
How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilitie...
arxiv.org
June 24, 2025 at 4:29 PM
Check out Sheridan’s work on concept induction circuits -- the soft version of induction we were promised a while ago :)

During our multilingual concept patching experiments I have always been wondering whether it is those circuits doing the work. Finally, some evidence:
Concept heads also output language-agnostic word representations. If we patch the outputs of these heads from one translation prompt to another, we can change the *meaning* of the outputted word, without changing the language. (see prior work from @butanium.bsky.social and @wendlerc.bsky.social)
April 8, 2025 at 12:51 PM
In case you ever wondered what you could do if you had SAEs for intermediate results of diffusion models, we trained SDXL Turbo SAEs on 4 blocks for you. We noticed that they specialize into a "composition", a "detail", and a "style" block. And one that is hard to make sense of.
March 21, 2025 at 7:39 PM
Apply to Akhil's lab, he is great!
March 18, 2025 at 3:04 PM
Reposted by Chris Wendler
Lots of work coming soon to @iclr-conf.bsky.social and @naaclmeeting.bsky.social in April/May! Come chat with us about new methods for interpreting and editing LLMs, multilingual concept representations, sentence processing mechanisms, and arithmetic reasoning. 🧵
March 11, 2025 at 2:30 PM
Reposted by Chris Wendler
Excited about recent reasoning models? What is happening under the hood?
Join ARBOR: Analysis of Reasoning Behaviors thru *Open Research* - a radically open collaboration to reverse-engineer reasoning models!
Learn more: arborproject.github.io
1/N
ARBOR
arborproject.github.io
February 20, 2025 at 7:55 PM
This seems like an elegant idea!
This paper masks out principal components instead of RGB patches because
(1) visible pixels may be redundant with masked ones,
(2) visible pixels may not be predictive of masked regions.

+38% on classification tasks.

I wonder how much CroCo & *ST3R might benefit from this.
arxiv.org/abs/2502.06314
February 17, 2025 at 5:10 PM
Reposted by Chris Wendler
DeepSeek R1 shows how important it is to be studying the internals of reasoning models. Try our code: Here @canrager.bsky.social shows a method for auditing AI bias by probing the internal monologue.

dsthoughts.baulab.info

I'd be interested in your thoughts.
dsthoughts.baulab
January 31, 2025 at 2:30 PM
Reposted by Chris Wendler
The AI agent spectrum
Separating different classes of AI agents from a long history of reinforcement learning.
Why we can be optimistic for AI agents but also extremely critical of the terrible communications around them to date.
Plus, some policy guidance.
The AI Agent Spectrum
Separating different classes of AI agents from a long history of reinforcement learning.
buff.ly
December 18, 2024 at 3:50 PM
The resources you find online on transformers are just next level... My jaw dropped when I first stumbled upon this video series: www.youtube.com/watch?v=V3NQ...
0L - Theory [rough early thoughts]
YouTube video by Mechanistic Interpretability
www.youtube.com
December 13, 2024 at 8:54 AM
Reposted by Chris Wendler
Ok, it is yesterdays news already, but good night sleep is important.

After 7 amazing years at Google Brain/DM, I am joining OpenAI. Together with @xzhai.bsky.social and @giffmana.ai, we will establish OpenAI Zurich office. Proud of our past work and looking forward to the future.
December 4, 2024 at 9:14 AM
bit grumpy but great summary of the tokenformer paper

www.youtube.com/watch?v=gfU5...
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (Paper Explained)
YouTube video by Yannic Kilcher
www.youtube.com
November 25, 2024 at 9:38 PM
Reposted by Chris Wendler
Can we understand and control how language models balance context and prior knowledge? Our latest paper shows it’s all about a 1D knob! 🎛️
arxiv.org/abs/2411.07404

Co-led with
@kevdududu.bsky.social - @niklasstoehr.bsky.social , Giovanni Monea, @wendlerc.bsky.social, Robert West & Ryan Cotterell.
November 22, 2024 at 3:49 PM
In case you also wondered how to derive the maximal update parametrisation (muP) learning rate for ADAM. I did a short write up: tinyurl.com/mup-for-adam. Thanks Ilia Badanin and Eugene Golikov for your help on this.
Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.
A new tool that blends your everyday work apps into one. It's the all-in-one workspace for you and your team
tinyurl.com
November 20, 2024 at 12:02 PM