pauljanson002.bsky.social
@pauljanson002.bsky.social
7/N🧵 For language models, we continually pre-trained on DCLM→Stack→German datasets, representing increasingly strong distribution shifts. We found that infinite LR schedules consistently showed lower validation loss on previous tasks.
March 17, 2025 at 8:27 PM
6/N🧵 For vision, we used MAE pre-training across ImageNet → Places → FireRisk (a progression from object recognition to scene understanding to aerial imagery). Looking at our results, infinite LR schedules have better knowledge retention across all scenarios!
March 17, 2025 at 8:27 PM
4/N🧵 The "infinite learning rate schedule" has four phases: warmup, cooldown, constant, and annealing. The constant phase is key - it allows continual training without a predefined endpoint. For deployment, you can anneal to get final performance gains.
March 17, 2025 at 8:27 PM
1/N🧵 Have you been using cosine decay to continually pre-train your foundation models? 💭 Excited to share our new paper, Beyond Cosine Decay, where we explore infinite LR schedulers ♾. Check it out! arxiv.org/abs/2503.02844. #MachineLearning #AI #optimization #continualAI
March 17, 2025 at 8:27 PM