pauljanson002.bsky.social
@pauljanson002.bsky.social
N/N🧵 Read our full paper: arxiv.org/abs/2503.02844. This is joint work with
Vaibhav Singh, Paria Mehrbod , Adam Ibrahim, Irina Rish, Eugene Belilovsky, Benjamin Thérien
Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training
The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. While self-supervised learning (SSL) has emerged as a powerful ...
arxiv.org
March 17, 2025 at 8:27 PM
7/N🧵 For language models, we continually pre-trained on DCLM→Stack→German datasets, representing increasingly strong distribution shifts. We found that infinite LR schedules consistently showed lower validation loss on previous tasks.
March 17, 2025 at 8:27 PM
6/N🧵 For vision, we used MAE pre-training across ImageNet → Places → FireRisk (a progression from object recognition to scene understanding to aerial imagery). Looking at our results, infinite LR schedules have better knowledge retention across all scenarios!
March 17, 2025 at 8:27 PM
5/N🧵We tested our approach across both vision and language domains with non-IID data distributions - meaning severe distribution shifts that typically cause catastrophic forgetting.
March 17, 2025 at 8:27 PM
4/N🧵 The "infinite learning rate schedule" has four phases: warmup, cooldown, constant, and annealing. The constant phase is key - it allows continual training without a predefined endpoint. For deployment, you can anneal to get final performance gains.
March 17, 2025 at 8:27 PM
3/N🧵The problem: Most continual pre-training approaches use repeated cosine annealing with a fixed duration. This limits future training and causes forgetting during the re-warming phases. We propose using infinite learning rate schedules instead.
March 17, 2025 at 8:27 PM
2/N🧵 But how? They provide better control over catastrophic forgetting—without needing predefined training durations! Could this be the key to real-world AI systems that continuously adapt to evolving data while preserving past knowledge?
March 17, 2025 at 8:27 PM