Lightnews — Scholar-powered news

Prasanna Mayilvahanan

@prasannamayil.bsky.social

26 followers 35 following 8 posts

PhD student in ML at MPI-IS. Prev Apple.

Interested in robustness at scale and reasoning.

Posts Replies Media Videos

Prasanna Mayilvahanan

@prasannamayil.bsky.social

Work co-led with @thwiedemer.bsky.social, in collaboration with Sayak Mallick, Matthias Bethge and @wielandbrendel.bsky.social.

Website: brendel-group.github.io/llm-line/
Preprint: arxiv.org/abs/2502.12120
Code: github.com/brendel-grou...
8/8

February 18, 2025 at 2:09 PM

Prasanna Mayilvahanan

@prasannamayil.bsky.social

Further, our results suggest that for a given pretraining data, breaking past current loss-to-loss trends requires radically new architectures or loss functions. Existing models all behave strikingly alike. 7/8

February 18, 2025 at 2:09 PM

Prasanna Mayilvahanan

@prasannamayil.bsky.social

🔍 Our work refines the understanding of scaling laws beyond compute-based models, showing that loss-to-loss trends are shaped by training data, not model structure. The implications? Better dataset curation can unlock better generalization. 6/8

February 18, 2025 at 2:09 PM

Prasanna Mayilvahanan

@prasannamayil.bsky.social

📉 In contrast, architecture, model size, context length, and optimizer settings have negligible impact. This suggests architectures can be freely optimized for efficiency, while data curation is the real key to strong generalization. 5/8

February 18, 2025 at 2:09 PM

Prasanna Mayilvahanan

@prasannamayil.bsky.social

📊 Key finding: The choice of pretraining data and tokenizer has the largest impact on scaling trends. Even switching from Llama (Transformer) to Mamba (State-Space Model) barely changes loss-to-loss relationships! 4/8

February 18, 2025 at 2:09 PM

Prasanna Mayilvahanan

@prasannamayil.bsky.social

We systematically vary pretraining data, tokenizer, architecture (Llama vs. Mamba), model size, context length, and optimizer settings—evaluating over 6000 model checkpoints—to uncover the true drivers of loss-to-loss scaling laws. 3/8

February 18, 2025 at 2:09 PM

Prasanna Mayilvahanan

@prasannamayil.bsky.social

Compute-to-train loss scaling laws guide LLM pretraining, but how do training/val losses map to downstream task loss? What factors shape these laws? We analyze loss-to-loss scaling laws, extending prior work beyond a single architectural setting to a number of configurations. 2/8

February 18, 2025 at 2:09 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news