Lightnews — Scholar-powered news

Shayne Longpre

@shaynelongpre.bsky.social

Huge thanks to the coauthors who built this with me: @frimelle.bsky.social, @cakiki.bsky.social, Campbell Lund, Atharva Kulkarni, Emily chen, Irene Solaiman, @evijit.io, and @yjernite.bsky.social 🙏

November 26, 2025 at 4:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Why now: Open models are more capable, more global, and more strategic. Governance, competition policy, and research norms need accurate, transparent ecosystem analysis.

November 26, 2025 at 4:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

We (mainly Emily chen) built a live dashboard to monitor concentration, participation, and technical trends—so policy, research, and industry can stay evidence-driven.

huggingface.co/spaces/econo...

November 26, 2025 at 4:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Finding 4: Transparency is slipping.

Models disclosing training data fell ~79% (2022) → ~39% (2025). And for the first time, open-weights > truly open-source—raising accountability and reproducibility concerns.

November 26, 2025 at 4:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Finding 3: A new intermediary dev layer:

Community repos that quantize, repack, and adapt base models now move a large fraction of real-world usage.

MLX-Community, SD Concepts Library, LMStudio-Community and others are consolidating models for deployment, & artistic adaptation.

November 26, 2025 at 4:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Finding 2: Models are getting large, multimodal—and efficient.

Average downloaded size rose 17× with 7× MoE and 5× quantization adoption; multimodal & video downloads grew ~3.4×.

November 26, 2025 at 4:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Finding 1: Power is rebalancing.

US big-tech share (Google/Meta/OpenAI) dominance has dissipated while community/unaffiliated devs surged, and Chinese industry (DeepSeek, Qwen) now commands a major share—hinting at a new consolidation wave among open-weights.

November 26, 2025 at 4:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Summary: We traced the complete, de-duplicated history of weekly downloads on Hugging Face and aligned it with model metadata to track concentration, participation, and model trends.

📄 Paper: dataprovenance.org/economies-of...

We also release a Dashboard: huggingface.co/spaces/econo...

dataprovenance.org

November 26, 2025 at 4:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

This work provides the scientific foundation for democratizing scaling laws beyond English.

Full paper: arxiv.org/pdf/2510.22037

Huge thanks to my brilliant co-authors: Sneha, Niklas, I-Hung, Isaac, Sandy, Sercan, Chen-Yu, and Sayna!

arxiv.org

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Q4: When should you pretrain from scratch vs finetune a multilingual checkpoint?

🌟Answer: We found compute-optimal crossover points for every model size.

Rough rule of thumb: finetune if your compute budget C is < 10^10 x N ^1.54, otherwise pretrain.

8/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Remarkably, this means 32% less data per language due to positive cross-lingual transfer—but you still need more total compute.

The curse is real but quantifiable: ϕ=0.11 (capacity penalty), ψ=-0.04 (data benefit from transfer).

7/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Q3: How much do you need to scale when adding languages? (The "curse of multilinguality")

🌟Answer: We derived closed-form equations! To go from K to 4K languages while maintaining performance: scale data by 2.74×, model size by 1.4×.

6/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

🌟Key insight:🌟 shared script beats shared language family for positive transfer!

Languages sharing writing systems (e.g., Latin) show dramatically better transfer (mean: -0.23) vs different scripts (mean: -0.39).

Also important: transfer is often asymmetric—A helping B ≠ B helping A.

5/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Q2: Which languages actually help each other during training? And how much?

🌟Answer: We measure this empirically. We built a 38×38 transfer matrix, or 1,444 language pairs—the largest such resource to date.

We highlight the top 5 most beneficial source languages for each target language.

4/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

ATLAS models cross-lingual transfer explicitly: separating (1) target language data, (2) beneficial transfer languages, and (3) other languages.

Without modeling transfer, existing laws fail on multilingual settings.

3/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Q1: Can we build a scaling law that generalizes to unseen model sizes (N), data amounts (D), AND language mixtures (M)?

🌟Answer: Yes! ATLAS outperforms prior work with R²(N)=0.88 vs 0.68, and R²(M)=0.82 vs 0.69 for mixture generalization.

2/

October 28, 2025 at 2:03 PM

Shayne Longpre

@shaynelongpre.bsky.social

Good question. @scasper.bsky.social would know best?

October 21, 2025 at 4:11 PM

Shayne Longpre

@shaynelongpre.bsky.social

@seungonekim.bsky.social, who led the effort, is one of the best young AI researchers I’ve ever worked with.

He has done some of the best research on fine-grained, scalable, and human-aligned LLM-as-a-judge evaluation.

➡️ Flask
➡️ Prometheus 1 & 2
➡️ Multilingual Prometheus
➡️ KMMLU
➡️ BigGen Bench

May 6, 2025 at 1:50 PM

Shayne Longpre

@shaynelongpre.bsky.social

Paper: arxiv.org/pdf/2406.05761
Code: github.com/prometheus-e...

2/

arxiv.org

May 6, 2025 at 1:50 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news