tahayassine.me
On the other hand, I think for your use case what you're looking for is training a task-level MoE rather than a token-level one. For example, both [1] and [2] find that a task-MoE is better than a token-MoE for language related tasks.
On the other hand, I think for your use case what you're looking for is training a task-level MoE rather than a token-level one. For example, both [1] and [2] find that a task-MoE is better than a token-MoE for language related tasks.