Lightnews — Scholar-powered news

Alessandro Stolfo

@alestolfo.bsky.social

350 followers 65 following 2 posts

PhD @ ETHZ - LLM Interpretability
alestolfo.github.io

Posts Replies Media Videos

Reposted by Alessandro Stolfo

Yucheng Sun

@yuchengsun.bsky.social

1/6: Can we use an LLM’s hidden activations to predict and prevent wrong predictions? When it comes to arithmetic, yes!
I’m presenting new work w/
@alestolfo.bsky.social
“Probing for Arithmetic Errors in LMs” @ #ICML2025 Act Interp WS
🧵 below

July 18, 2025 at 5:22 PM

Reposted by Alessandro Stolfo

Aaron Mueller

@amuuueller.bsky.social

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!

April 23, 2025 at 6:15 PM

Alessandro Stolfo

@alestolfo.bsky.social

Our paper "Improving Instruction-Following in Language Models through Activation Steering” has been accepted to #ICLR2025!

We're also excited to share that our public GitHub repo is now live.
Code: github.com/microsoft/ll...
Camera-ready: arxiv.org/abs/2410.12877

April 15, 2025 at 4:35 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news