Lightnews — Scholar-powered news

pauljanson002.bsky.social

@pauljanson002.bsky.social

N/N🧵 Read our full paper: arxiv.org/abs/2503.02844. This is joint work with
Vaibhav Singh, Paria Mehrbod , Adam Ibrahim, Irina Rish, Eugene Belilovsky, Benjamin Thérien

Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. While self-supervised learning (SSL) has emerged as a powerful ...

arxiv.org

March 17, 2025 at 8:27 PM

pauljanson002.bsky.social

@pauljanson002.bsky.social

7/N🧵 For language models, we continually pre-trained on DCLM→Stack→German datasets, representing increasingly strong distribution shifts. We found that infinite LR schedules consistently showed lower validation loss on previous tasks.

March 17, 2025 at 8:27 PM

pauljanson002.bsky.social

@pauljanson002.bsky.social

6/N🧵 For vision, we used MAE pre-training across ImageNet → Places → FireRisk (a progression from object recognition to scene understanding to aerial imagery). Looking at our results, infinite LR schedules have better knowledge retention across all scenarios!

March 17, 2025 at 8:27 PM

pauljanson002.bsky.social

@pauljanson002.bsky.social

5/N🧵We tested our approach across both vision and language domains with non-IID data distributions - meaning severe distribution shifts that typically cause catastrophic forgetting.

March 17, 2025 at 8:27 PM

pauljanson002.bsky.social

@pauljanson002.bsky.social

4/N🧵 The "infinite learning rate schedule" has four phases: warmup, cooldown, constant, and annealing. The constant phase is key - it allows continual training without a predefined endpoint. For deployment, you can anneal to get final performance gains.

March 17, 2025 at 8:27 PM

pauljanson002.bsky.social

@pauljanson002.bsky.social

3/N🧵The problem: Most continual pre-training approaches use repeated cosine annealing with a fixed duration. This limits future training and causes forgetting during the re-warming phases. We propose using infinite learning rate schedules instead.

March 17, 2025 at 8:27 PM

pauljanson002.bsky.social

@pauljanson002.bsky.social

2/N🧵 But how? They provide better control over catastrophic forgetting—without needing predefined training durations! Could this be the key to real-world AI systems that continuously adapt to evolving data while preserving past knowledge?

March 17, 2025 at 8:27 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news