Lightnews — Scholar-powered news

Tian Jin

@tjin.bsky.social

Our results validate this unified scaling law across 30 LLM pre-training runs, spanning models from 58M to 468M parameters with up to 80% final sparsity, trained on up to 20x the Chinchilla-optimal token budget. 6/N

April 21, 2025 at 7:15 AM

Tian Jin

@tjin.bsky.social

Building off this observation, we extend the Chinchilla scaling law to account for both sparse/dense pre-training by using average param count during pre-training as the model size term. 5/N

April 21, 2025 at 7:15 AM

Tian Jin

@tjin.bsky.social

Surprisingly, this pair of models achieve matching quality (eval ppl)! We tested this on models with 162M starting parameters and 20-80% final sparsity. The results are consistent: sparse and dense models with the same average param count reach the matching final eval loss. 4/N

April 21, 2025 at 7:15 AM

Tian Jin

@tjin.bsky.social

Consider two parameter count vs. training step curves, w/ an equivalent area under the curve, ie., training FLOPs. Solid line = dense pre-training, dashed line = sparse pre-training w/ gradual pruning. While they differ in final param count, they match in average param count. 3/N

April 21, 2025 at 7:15 AM

Tian Jin

@tjin.bsky.social

📣 The Journey Matters: Our #ICLR2025 paper shows how to pretrain sparse LLMs with half the size of dense LLMs while maintaining quality. We found that the average parameter count during sparse pre-training predicts quality, not final size. An MIT/Rice/Google/ISTA collab 🧵 1/N

April 21, 2025 at 7:15 AM

Tian Jin

@tjin.bsky.social

The quality-speedup trade-off keeps improving with more training - showing no signs of saturation! We took 4 snapshots at different points of preference optimization (10% Round 1, 100% R1, 10% R2, 60% R2). As we train more, this trade-off improves toward the optimal top-right corner. 11/N

February 27, 2025 at 12:38 AM

Tian Jin

@tjin.bsky.social

We show that PASTA Pareto-dominates all existing async decoding methods! We achieve geometric mean speedups ranging from 1.21× to 1.93× with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline. 10/N

February 27, 2025 at 12:38 AM

Tian Jin

@tjin.bsky.social

Excited to share our work with friends from MIT/Google on Learned Asynchronous Decoding! LLM responses often contain chunks of tokens that are semantically independent. What if we can train LLMs to identify such chunks and decode them in parallel, thereby speeding up inference? 1/N

February 27, 2025 at 12:38 AM

Tian Jin

@tjin.bsky.social

The quality-speedup trade-off keeps improving with more training - showing no signs of saturation! We took 4 snapshots at different points of preference optimization (10% Round 1, 100% R1, 10% R2, 60% R2). As we train more, this trade-off steadily improves toward the optimal top-right corner. 10/N

February 26, 2025 at 11:42 PM

Tian Jin

@tjin.bsky.social

We show that PASTA Pareto-dominates all existing async decoding methods! We achieve geometric mean speedups ranging from 1.21× to 1.93× with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline. 9/N

February 26, 2025 at 11:42 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news