arxiv.org/pdf/2410.03972
It started from a question I kept running into:
When do RNNs trained on the same task converge/diverge in their solutions?
🧵⬇️
🔹Paper: arxiv.org/pdf/2410.03972
🔹Poster: Fri Dec 5, Poster #2001 at Exhibition Hall C, D, E
Happy to chat at NeurIPS or by email at [email protected]!
🔹Paper: arxiv.org/pdf/2410.03972
🔹Poster: Fri Dec 5, Poster #2001 at Exhibition Hall C, D, E
Happy to chat at NeurIPS or by email at [email protected]!
- support the contravariance principle (Cao & @dyamins.bsky.social)
- reveal when weight- & dynamic-level variability move together (or opposite)
- give "knobs" for controlling degeneracy, whether you're studying shared mechanisms or individual variability in task-trained RNNs.
- support the contravariance principle (Cao & @dyamins.bsky.social)
- reveal when weight- & dynamic-level variability move together (or opposite)
- give "knobs" for controlling degeneracy, whether you're studying shared mechanisms or individual variability in task-trained RNNs.
Both types of structural regularization reduce degeneracy across all levels. Regularization nudges networks toward more consistent, shared solutions.
Both types of structural regularization reduce degeneracy across all levels. Regularization nudges networks toward more consistent, shared solutions.
When we fix feature learning (using µP), larger RNNs converge to more consistent solutions at all levels — weights, dynamics, and behavior.
A clean convergence-with-scale effect, demonstrated on RNNs across levels.
When we fix feature learning (using µP), larger RNNs converge to more consistent solutions at all levels — weights, dynamics, and behavior.
A clean convergence-with-scale effect, demonstrated on RNNs across levels.
It also increases behavioral degeneracy under OOD inputs (likely due to overfitting).
It also increases behavioral degeneracy under OOD inputs (likely due to overfitting).
Complex tasks push RNNs into feature learning, where the network has to adapt its internal weights and features to solve the task. Weights travel much farther from initialization, leading to more dispersed weights in the weight space (higher degeneracy).
Complex tasks push RNNs into feature learning, where the network has to adapt its internal weights and features to solve the task. Weights travel much farther from initialization, leading to more dispersed weights in the weight space (higher degeneracy).
As tasks get harder, we observe less degeneracy in dynamics/behavior, but more degeneracy in the weights.
When trained on harder tasks, RNNs converge to similar neural dynamics and OOD behavior, but their weight configurations diverge. Why?
As tasks get harder, we observe less degeneracy in dynamics/behavior, but more degeneracy in the weights.
When trained on harder tasks, RNNs converge to similar neural dynamics and OOD behavior, but their weight configurations diverge. Why?
- task complexity
- learning regime
- network size
- regularization
Our findings:
- task complexity
- learning regime
- network size
- regularization
Our findings:
🎯 Behavior: variability in OOD performance
🧠 Dynamics: distance btwn neural trajectories, quantified by Dynamical Similarity Analysis
⚙️ Weights: permutation-invariant Frobenius distance btwn recurrent weights
🎯 Behavior: variability in OOD performance
🧠 Dynamics: distance btwn neural trajectories, quantified by Dynamical Similarity Analysis
⚙️ Weights: permutation-invariant Frobenius distance btwn recurrent weights
arxiv.org/pdf/2410.03972
It started from a question I kept running into:
When do RNNs trained on the same task converge/diverge in their solutions?
🧵⬇️
arxiv.org/pdf/2410.03972
It started from a question I kept running into:
When do RNNs trained on the same task converge/diverge in their solutions?
🧵⬇️