Lightnews — Scholar-powered news

Reposted by Chris Wendler

Clément Dumas

@butanium.bsky.social

Our mech interp ICML workshop paper got accepted to ACL 2025 main! 🎉
In this updated version, we extended our results to several models and showed they can actually generate good definitions of mean concept representations across languages.🧵

Clément Dumas on X: "Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️ https://t.co/dDS5iv199i" / X

Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️ https://t.co/dDS5iv199i

x.com

June 29, 2025 at 11:07 PM

Reposted by Chris Wendler

Can

@canrager.bsky.social

Can we uncover the list of topics a language model is censored on?

Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:

June 13, 2025 at 3:59 PM

Reposted by Chris Wendler

Natalie Shapira

@natalieshapira.bsky.social

I am really proud to share our work led by Nikhil Prakash and in collaboration with more mechanistic interpretability and Theory of Mind (ToM) researchers:
arxiv.org/abs/2505.14685
You can find a tweet here with nice animations:
x.com/nikhil07prak...

Language Models use Lookbacks to Track Beliefs

How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilitie...

arxiv.org

June 24, 2025 at 4:29 PM

Chris Wendler

@wendlerc.bsky.social

Check out Sheridan’s work on concept induction circuits -- the soft version of induction we were promised a while ago :)

During our multilingual concept patching experiments I have always been wondering whether it is those circuits doing the work. Finally, some evidence:

Sheridan Feucht @sfeucht.bsky.social · Apr 7

Concept heads also output language-agnostic word representations. If we patch the outputs of these heads from one translation prompt to another, we can change the *meaning* of the outputted word, without changing the language. (see prior work from @butanium.bsky.social and @wendlerc.bsky.social)

April 8, 2025 at 12:51 PM

Chris Wendler

@wendlerc.bsky.social

In case you ever wondered what you could do if you had SAEs for intermediate results of diffusion models, we trained SDXL Turbo SAEs on 4 blocks for you. We noticed that they specialize into a "composition", a "detail", and a "style" block. And one that is hard to make sense of.

March 21, 2025 at 7:39 PM

Chris Wendler

@wendlerc.bsky.social

Apply to Akhil's lab, he is great!

Akhil Arora @akhilarora.bsky.social · Mar 18

I am recruiting 2 PhD students for Fall'25 @csaudk.bsky.social to work on bleeding-edge topics in #NLProc #LLMs #AIAgents (e.g. LLM reasoning, knowledge-seeking agents, and more).

Details: www.cs.au.dk/~clan/openings
Deadline: May 1, 2025

Please boost!

cc: @aicentre.dk @wikiresearch.bsky.social

Open positions and projects

### Open semester and Master's projects If you're an AU student looking for a semester project, a Bachelor project, or an MS thesis project, please refer to [this list](projects). ### Prospective PhD ...

www.cs.au.dk

March 18, 2025 at 3:04 PM

Reposted by Chris Wendler

Aaron Mueller

@amuuueller.bsky.social

Lots of work coming soon to @iclr-conf.bsky.social and @naaclmeeting.bsky.social in April/May! Come chat with us about new methods for interpreting and editing LLMs, multilingual concept representations, sentence processing mechanisms, and arithmetic reasoning. 🧵

March 11, 2025 at 2:30 PM

Reposted by Chris Wendler

Andrew Lee

@ajyl.bsky.social

Excited about recent reasoning models? What is happening under the hood?
Join ARBOR: Analysis of Reasoning Behaviors thru *Open Research* - a radically open collaboration to reverse-engineer reasoning models!
Learn more: arborproject.github.io
1/N

ARBOR

arborproject.github.io

February 20, 2025 at 7:55 PM

Chris Wendler

@wendlerc.bsky.social

This seems like an elegant idea!

Chris Offner @chrisoffner3d.bsky.social · Feb 17

This paper masks out principal components instead of RGB patches because
(1) visible pixels may be redundant with masked ones,
(2) visible pixels may not be predictive of masked regions.

+38% on classification tasks.

I wonder how much CroCo & *ST3R might benefit from this.
arxiv.org/abs/2502.06314

February 17, 2025 at 5:10 PM

Reposted by Chris Wendler

David Bau

@davidbau.bsky.social

DeepSeek R1 shows how important it is to be studying the internals of reasoning models. Try our code: Here @canrager.bsky.social shows a method for auditing AI bias by probing the internal monologue.

dsthoughts.baulab.info

I'd be interested in your thoughts.

dsthoughts.baulab

January 31, 2025 at 2:30 PM

Reposted by Chris Wendler

Nathan Lambert

@natolambert.bsky.social

The AI agent spectrum
Separating different classes of AI agents from a long history of reinforcement learning.
Why we can be optimistic for AI agents but also extremely critical of the terrible communications around them to date.
Plus, some policy guidance.

The AI Agent Spectrum

Separating different classes of AI agents from a long history of reinforcement learning.

buff.ly

December 18, 2024 at 3:50 PM

Chris Wendler

@wendlerc.bsky.social

The resources you find online on transformers are just next level... My jaw dropped when I first stumbled upon this video series: www.youtube.com/watch?v=V3NQ...

0L - Theory [rough early thoughts]

YouTube video by Mechanistic Interpretability

www.youtube.com

December 13, 2024 at 8:54 AM

Reposted by Chris Wendler

Alexander Kolesnikov

@kolesnikov.ch

Ok, it is yesterdays news already, but good night sleep is important.

After 7 amazing years at Google Brain/DM, I am joining OpenAI. Together with @xzhai.bsky.social and @giffmana.ai, we will establish OpenAI Zurich office. Proud of our past work and looking forward to the future.

December 4, 2024 at 9:14 AM

Chris Wendler

@wendlerc.bsky.social

bit grumpy but great summary of the tokenformer paper

www.youtube.com/watch?v=gfU5...

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (Paper Explained)

YouTube video by Yannic Kilcher

www.youtube.com

November 25, 2024 at 9:38 PM

Reposted by Chris Wendler

Julian Minder

@jkminder.bsky.social

Can we understand and control how language models balance context and prior knowledge? Our latest paper shows it’s all about a 1D knob! 🎛️
arxiv.org/abs/2411.07404

Co-led with
@kevdududu.bsky.social - @niklasstoehr.bsky.social , Giovanni Monea, @wendlerc.bsky.social, Robert West & Ryan Cotterell.

November 22, 2024 at 3:49 PM

Chris Wendler

@wendlerc.bsky.social

In case you also wondered how to derive the maximal update parametrisation (muP) learning rate for ADAM. I did a short write up: tinyurl.com/mup-for-adam. Thanks Ilia Badanin and Eugene Golikov for your help on this.

Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

A new tool that blends your everyday work apps into one. It's the all-in-one workspace for you and your team

tinyurl.com

November 20, 2024 at 12:02 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news