Stephanie Chan
@scychan.bsky.social
Staff Research Scientist at Google DeepMind. Artificial and biological brains 🤖 🧠
Some general takeaways for interp:
March 11, 2025 at 6:18 PM
Some general takeaways for interp:
4. We provide intuition for these dynamics through a simple mathematical model.
March 11, 2025 at 6:18 PM
4. We provide intuition for these dynamics through a simple mathematical model.
3. A lot of previous work (including our own), has emphasized *competition* between in-context and in-weights learning.
But we find that cIWL and ICL actually compete AND cooperate, via shared subcircuits. In fact, ICL cannot emerge if cIWL is blocked from emerging, even though ICL emerges first!
But we find that cIWL and ICL actually compete AND cooperate, via shared subcircuits. In fact, ICL cannot emerge if cIWL is blocked from emerging, even though ICL emerges first!
March 11, 2025 at 6:18 PM
3. A lot of previous work (including our own), has emphasized *competition* between in-context and in-weights learning.
But we find that cIWL and ICL actually compete AND cooperate, via shared subcircuits. In fact, ICL cannot emerge if cIWL is blocked from emerging, even though ICL emerges first!
But we find that cIWL and ICL actually compete AND cooperate, via shared subcircuits. In fact, ICL cannot emerge if cIWL is blocked from emerging, even though ICL emerges first!
2. At the end of training, ICL doesn't give way to in-weights learning (IWL), as we previously thought. Instead, the model prefers a surprising strategy that is a *combination* of the two!
We call this combo "cIWL" (context-constrained in-weights learning).
We call this combo "cIWL" (context-constrained in-weights learning).
March 11, 2025 at 6:18 PM
2. At the end of training, ICL doesn't give way to in-weights learning (IWL), as we previously thought. Instead, the model prefers a surprising strategy that is a *combination* of the two!
We call this combo "cIWL" (context-constrained in-weights learning).
We call this combo "cIWL" (context-constrained in-weights learning).