- support the contravariance principle (Cao & @dyamins.bsky.social)
- reveal when weight- & dynamic-level variability move together (or opposite)
- give "knobs" for controlling degeneracy, whether you're studying shared mechanisms or individual variability in task-trained RNNs.
- support the contravariance principle (Cao & @dyamins.bsky.social)
- reveal when weight- & dynamic-level variability move together (or opposite)
- give "knobs" for controlling degeneracy, whether you're studying shared mechanisms or individual variability in task-trained RNNs.
Both types of structural regularization reduce degeneracy across all levels. Regularization nudges networks toward more consistent, shared solutions.
Both types of structural regularization reduce degeneracy across all levels. Regularization nudges networks toward more consistent, shared solutions.
When we fix feature learning (using µP), larger RNNs converge to more consistent solutions at all levels — weights, dynamics, and behavior.
A clean convergence-with-scale effect, demonstrated on RNNs across levels.
When we fix feature learning (using µP), larger RNNs converge to more consistent solutions at all levels — weights, dynamics, and behavior.
A clean convergence-with-scale effect, demonstrated on RNNs across levels.
It also increases behavioral degeneracy under OOD inputs (likely due to overfitting).
It also increases behavioral degeneracy under OOD inputs (likely due to overfitting).
As tasks get harder, we observe less degeneracy in dynamics/behavior, but more degeneracy in the weights.
When trained on harder tasks, RNNs converge to similar neural dynamics and OOD behavior, but their weight configurations diverge. Why?
As tasks get harder, we observe less degeneracy in dynamics/behavior, but more degeneracy in the weights.
When trained on harder tasks, RNNs converge to similar neural dynamics and OOD behavior, but their weight configurations diverge. Why?
- task complexity
- learning regime
- network size
- regularization
Our findings:
- task complexity
- learning regime
- network size
- regularization
Our findings:
arxiv.org/pdf/2410.03972
It started from a question I kept running into:
When do RNNs trained on the same task converge/diverge in their solutions?
🧵⬇️
arxiv.org/pdf/2410.03972
It started from a question I kept running into:
When do RNNs trained on the same task converge/diverge in their solutions?
🧵⬇️