🌍 Is scaling diff by lang?
🧙♂️ Can we model the curse of multilinguality?
⚖️ Pretrain vs finetune from checkpoint?
🔀 X-lingual transfer scores across langs?
1/🧵
🌍 Is scaling diff by lang?
🧙♂️ Can we model the curse of multilinguality?
⚖️ Pretrain vs finetune from checkpoint?
🔀 X-lingual transfer scores across langs?
1/🧵
Come by, or reach out if you want to chat about pretraining, scaling laws or conditional computation!
arxiv.org/abs/2310.07707
Come by, or reach out if you want to chat about pretraining, scaling laws or conditional computation!
arxiv.org/abs/2310.07707