Taha Yassine
tahayassine.me
Taha Yassine
@tahayassine.me
Independent researcher working on NLP/LLMs · PhD in AI & Wireless Comms

tahayassine.me
[3] is perhaps the most thorough work I could find exploring this setup for learning multiple tasks. They also investigate soft-routing. [4] seems interesting too, they train LoRAs on the same base for different tasks and train the router to select the correct LoRA to use for a given input.
December 16, 2024 at 9:33 PM

On the other hand, I think for your use case what you're looking for is training a task-level MoE rather than a token-level one. For example, both [1] and [2] find that a task-MoE is better than a token-MoE for language related tasks.
December 16, 2024 at 9:33 PM
In the case of Mixtral they don't mention any special auxiliary loss to incentivize the router to push experts to specialize. In general, an auxiliary term may be added to encourage an even assignment of tokens across experts for better load balancing.
December 16, 2024 at 9:33 PM
Sorry I'm only responding now.
I'm no expert when it comes to MoEs (no pun intended), but I believe what you're referring to is the specialization of experts under no explicit domain conditioning.
December 16, 2024 at 9:33 PM
Maybe you could train an MoE? Your aux model would be the router and part of the main model, and you'd train it with a corresponding loss term to route to the correct expert at training time. This obviously means you'd have as many experts as you have modes in your data dist if you do hard routing.
December 14, 2024 at 9:54 AM
These madlads also made a tool that allows you to create a colormap and shows you advanced metrics to help you
December 3, 2024 at 10:07 PM
"network graph" seems to work as a workaround
December 1, 2024 at 1:28 PM
Wow, TIL. Now it's gonna sound weird when I use in french.
December 1, 2024 at 1:17 PM
Lofi for reading papers and synthwave for coding
November 30, 2024 at 4:38 AM
Nice to know, will give it a try
November 29, 2024 at 3:36 PM
Have you considered using an eGPU?
November 29, 2024 at 3:21 PM
Any reason you went this route rather than using something like Ansible?
November 27, 2024 at 8:13 AM
- vscode works really well with the remote extension, so no need to use the browser client imo
- nix shells are great if you use them outside of Python, cf my 1st point. I use them with direnv and really like the dx
November 25, 2024 at 8:36 PM
This is almost my current setup but here are a few points:
- I use an eGPU but a dedicated server is cool too
- you really want to use docker on top of NixOS, it's a disaster with python because packages are not always available/up to date/working; in containers use traditional pip/uv
November 25, 2024 at 8:36 PM
I'm sure there was a spike of downloads at some point around january of this year and they were all me
November 25, 2024 at 4:08 AM