Lightnews — Scholar-powered news

Lee Sharkey

@leesharkey.bsky.social

Work done by Dan Braun, Lucius Bushnaq, Stefan Heimersheim, Jake Mendel, and me (Lee Sharkey)!

Check out our

- paper: publications.apolloresearch.ai/apd

- blog post: www.alignmentforum.org/posts/EPefYW...

Attribution-based parameter decomposition — AI Alignment Forum

This is a linkpost for Apollo Research's new interpretability paper: …

www.alignmentforum.org

January 27, 2025 at 7:29 PM

Lee Sharkey

@leesharkey.bsky.social

Overall, our approach helps address issues that seemed challenging using SAEs.

- It decomposes network parameters directly
- It suggests a conceptual foundation for the concept of a 'feature'
- It suggests an approach to better understanding feature geometry

and more.

January 27, 2025 at 7:29 PM

Lee Sharkey

@leesharkey.bsky.social

The method is currently sensitive to hyperparameters, so still needs some work before it can be scaled.

But we have a few ideas for how to achieve this and plan to address that issue next!

January 27, 2025 at 7:29 PM

Lee Sharkey

@leesharkey.bsky.social

And the method lets us identify computations that are spread across multiple layers.

This has been conceptually challenging for the SAE paradigm to overcome. (Crosscoder features aren't the computations themselves, but are more akin to the results of the computations).

January 27, 2025 at 7:29 PM

Lee Sharkey

@leesharkey.bsky.social

Our method lets us identify fundamental computations (or 'circuits') in a toy model of 'Compressed computation', which is a phenomenon similar to 'Computation in superposition'.

Each parameter component learns to implement a different basic computation.

January 27, 2025 at 7:29 PM

Lee Sharkey

@leesharkey.bsky.social

The key idea: Neural networks only need certain parts of their parameters on each forward pass. The rest can be thrown away (on that forward pass).

How to identify which parts are needed?

Using attribution methods.

Hence the name Attribution-based Parameter Decomposition!

January 27, 2025 at 7:29 PM

Lee Sharkey

@leesharkey.bsky.social

For example, with anthropic's Toy Model of Superposition, we can decompose the parameters directly into mechanisms that are used by individual features.

January 27, 2025 at 7:29 PM

Lee Sharkey

@leesharkey.bsky.social

Here, 'simple' means they span as few ranks and as few layers as possible.

We think parameter components that satisfy these properties can reasonably be called the network's mechanisms.

This lets us identify mechanisms in toy models where there is known ground truth!

January 27, 2025 at 7:29 PM

Lee Sharkey

@leesharkey.bsky.social

Parameter components are trained for three things:
- They sum to the original network's parameters
- As few as possible are needed to replicate the network's behavior on any given datapoint in the training data
- They are individually 'simpler' than the whole network.

January 27, 2025 at 7:29 PM

Lee Sharkey

@leesharkey.bsky.social

A neural network's weights can be thought of as a big vector in 'parameter space'.

During training, gradient descent etches a network's 'mechanisms' into its parameter vector.

We look for those mechanisms by decomposing that parameter vector into 'parameter components'.

January 27, 2025 at 7:29 PM

Reposted by Lee Sharkey

Laura

@lauraruis.bsky.social

To my surprise, we find the opposite of what I thought when we started this project:

The approach to reasoning LLMs use looks unlike retrieval, and more like a generalisable strategy synthesising procedural knowledge from many documents doing a similar form of reasoning.

November 20, 2024 at 4:35 PM

Lee Sharkey

@leesharkey.bsky.social

I'd like to be added, thanks!

November 19, 2024 at 11:06 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news