www.jachterberg.com
arxiv.org/abs/2506.02813
arxiv.org/abs/2506.02813
The first of these is a routing loss that penalizes the use of more complex experts during training, and the second scales this loss by the model’s performance on the task being solved.
The first of these is a routing loss that penalizes the use of more complex experts during training, and the second scales this loss by the model’s performance on the task being solved.
(1) Consistency: Models trained on the same tasks should have similar pathways
(2) Self-sufficiency: Pathways should be primarily reliant on their own experts
(3) Distinctness: Many distinct pathways should be present
(1) Consistency: Models trained on the same tasks should have similar pathways
(2) Self-sufficiency: Pathways should be primarily reliant on their own experts
(3) Distinctness: Many distinct pathways should be present
We train model on 82 tasks of varying complexity (ModCog)!
We train model on 82 tasks of varying complexity (ModCog)!