Check out our
- paper: publications.apolloresearch.ai/apd
- blog post: www.alignmentforum.org/posts/EPefYW...
Check out our
- paper: publications.apolloresearch.ai/apd
- blog post: www.alignmentforum.org/posts/EPefYW...
- It decomposes network parameters directly
- It suggests a conceptual foundation for the concept of a 'feature'
- It suggests an approach to better understanding feature geometry
and more.
- It decomposes network parameters directly
- It suggests a conceptual foundation for the concept of a 'feature'
- It suggests an approach to better understanding feature geometry
and more.
But we have a few ideas for how to achieve this and plan to address that issue next!
But we have a few ideas for how to achieve this and plan to address that issue next!
This has been conceptually challenging for the SAE paradigm to overcome. (Crosscoder features aren't the computations themselves, but are more akin to the results of the computations).
This has been conceptually challenging for the SAE paradigm to overcome. (Crosscoder features aren't the computations themselves, but are more akin to the results of the computations).
Each parameter component learns to implement a different basic computation.
Each parameter component learns to implement a different basic computation.
How to identify which parts are needed?
Using attribution methods.
Hence the name Attribution-based Parameter Decomposition!
How to identify which parts are needed?
Using attribution methods.
Hence the name Attribution-based Parameter Decomposition!
We think parameter components that satisfy these properties can reasonably be called the network's mechanisms.
This lets us identify mechanisms in toy models where there is known ground truth!
We think parameter components that satisfy these properties can reasonably be called the network's mechanisms.
This lets us identify mechanisms in toy models where there is known ground truth!
- They sum to the original network's parameters
- As few as possible are needed to replicate the network's behavior on any given datapoint in the training data
- They are individually 'simpler' than the whole network.
- They sum to the original network's parameters
- As few as possible are needed to replicate the network's behavior on any given datapoint in the training data
- They are individually 'simpler' than the whole network.
During training, gradient descent etches a network's 'mechanisms' into its parameter vector.
We look for those mechanisms by decomposing that parameter vector into 'parameter components'.
During training, gradient descent etches a network's 'mechanisms' into its parameter vector.
We look for those mechanisms by decomposing that parameter vector into 'parameter components'.
The approach to reasoning LLMs use looks unlike retrieval, and more like a generalisable strategy synthesising procedural knowledge from many documents doing a similar form of reasoning.
The approach to reasoning LLMs use looks unlike retrieval, and more like a generalisable strategy synthesising procedural knowledge from many documents doing a similar form of reasoning.