Lightnews — Scholar-powered news

Fabian Schaipp

@fschaipp.bsky.social

420 followers 240 following 16 posts

Researcher in Optimization for ML at Inria Paris. Previously at TU Munich.

https://fabian-sp.github.io/

Posts Replies Media Videos

Fabian Schaipp

@fschaipp.bsky.social

Bonus: this provides a provable explanation for the benefit of cooldown: if we plug in the wsd schedule into the bound, a log-term (H_T+1) vanishes compared to constant LR (dark grey).

February 5, 2025 at 10:13 AM

Fabian Schaipp

@fschaipp.bsky.social

How does this help in practice? In continued training, we need to decrease the learning rate in the second phase. But by how much?

Using the theoretically optimal schedule (which can be computed for free), we obtain noticeable improvement in training 124M and 210M models.

February 5, 2025 at 10:13 AM

Fabian Schaipp

@fschaipp.bsky.social

This allows to understand LR schedules beyond experiments: we study (i) optimal cooldown length, (ii) the impact of gradient norm on the schedule performance.
The second part suggests that the sudden drop in loss during cooldown happens when gradient norms do not go to zero.

February 5, 2025 at 10:13 AM

Fabian Schaipp

@fschaipp.bsky.social

Using a bound from arxiv.org/pdf/2310.07831, we can reproduce the empirical behaviour of cosine and wsd (=constant+cooldown) schedule. Surprisingly the result is for convex problems, but still matches the actual loss of (nonconvex) LLM training.

February 5, 2025 at 10:13 AM

Fabian Schaipp

@fschaipp.bsky.social

nice!
Figure 9 looks like a lighthouse guiding the way (towards the data distribution)

December 13, 2024 at 4:27 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news