Shayne Longpre
banner
shaynelongpre.bsky.social
Shayne Longpre
@shaynelongpre.bsky.social
PhD @ MIT. Prev: Google Deepmind, Apple, Stanford. 🇨🇦 Interests: AI/ML/NLP, Data-centric AI, transparency & societal impact
Huge thanks to the coauthors who built this with me: @frimelle.bsky.social, @cakiki.bsky.social, Campbell Lund, Atharva Kulkarni, Emily chen, Irene Solaiman, @evijit.io, and @yjernite.bsky.social 🙏
November 26, 2025 at 4:03 PM
Why now: Open models are more capable, more global, and more strategic. Governance, competition policy, and research norms need accurate, transparent ecosystem analysis.
November 26, 2025 at 4:03 PM
We (mainly Emily chen) built a live dashboard to monitor concentration, participation, and technical trends—so policy, research, and industry can stay evidence-driven.

huggingface.co/spaces/econo...
November 26, 2025 at 4:03 PM
Finding 4: Transparency is slipping.

Models disclosing training data fell ~79% (2022) → ~39% (2025). And for the first time, open-weights > truly open-source—raising accountability and reproducibility concerns.
November 26, 2025 at 4:03 PM
Finding 3: A new intermediary dev layer:

Community repos that quantize, repack, and adapt base models now move a large fraction of real-world usage.

MLX-Community, SD Concepts Library, LMStudio-Community and others are consolidating models for deployment, & artistic adaptation.
November 26, 2025 at 4:03 PM
Finding 2: Models are getting large, multimodal—and efficient.

Average downloaded size rose 17× with 7× MoE and 5× quantization adoption; multimodal & video downloads grew ~3.4×.
November 26, 2025 at 4:03 PM
Finding 1: Power is rebalancing.

US big-tech share (Google/Meta/OpenAI) dominance has dissipated while community/unaffiliated devs surged, and Chinese industry (DeepSeek, Qwen) now commands a major share—hinting at a new consolidation wave among open-weights.
November 26, 2025 at 4:03 PM
Summary: We traced the complete, de-duplicated history of weekly downloads on Hugging Face and aligned it with model metadata to track concentration, participation, and model trends.

📄 Paper: dataprovenance.org/economies-of...

We also release a Dashboard: huggingface.co/spaces/econo...
dataprovenance.org
November 26, 2025 at 4:03 PM
This work provides the scientific foundation for democratizing scaling laws beyond English.

Full paper: arxiv.org/pdf/2510.22037

Huge thanks to my brilliant co-authors: Sneha, Niklas, I-Hung, Isaac, Sandy, Sercan, Chen-Yu, and Sayna!
arxiv.org
October 28, 2025 at 2:03 PM
Q4: When should you pretrain from scratch vs finetune a multilingual checkpoint?

🌟Answer: We found compute-optimal crossover points for every model size.

Rough rule of thumb: finetune if your compute budget C is < 10^10 x N ^1.54, otherwise pretrain.

8/
October 28, 2025 at 2:03 PM
Remarkably, this means 32% less data per language due to positive cross-lingual transfer—but you still need more total compute.

The curse is real but quantifiable: ϕ=0.11 (capacity penalty), ψ=-0.04 (data benefit from transfer).

7/
October 28, 2025 at 2:03 PM
Q3: How much do you need to scale when adding languages? (The "curse of multilinguality")

🌟Answer: We derived closed-form equations! To go from K to 4K languages while maintaining performance: scale data by 2.74×, model size by 1.4×.

6/
October 28, 2025 at 2:03 PM
🌟Key insight:🌟 shared script beats shared language family for positive transfer!

Languages sharing writing systems (e.g., Latin) show dramatically better transfer (mean: -0.23) vs different scripts (mean: -0.39).

Also important: transfer is often asymmetric—A helping B ≠ B helping A.

5/
October 28, 2025 at 2:03 PM
Q2: Which languages actually help each other during training? And how much?

🌟Answer: We measure this empirically. We built a 38×38 transfer matrix, or 1,444 language pairs—the largest such resource to date.

We highlight the top 5 most beneficial source languages for each target language.

4/
October 28, 2025 at 2:03 PM
ATLAS models cross-lingual transfer explicitly: separating (1) target language data, (2) beneficial transfer languages, and (3) other languages.

Without modeling transfer, existing laws fail on multilingual settings.

3/
October 28, 2025 at 2:03 PM
Q1: Can we build a scaling law that generalizes to unseen model sizes (N), data amounts (D), AND language mixtures (M)?

🌟Answer: Yes! ATLAS outperforms prior work with R²(N)=0.88 vs 0.68, and R²(M)=0.82 vs 0.69 for mixture generalization.

2/
October 28, 2025 at 2:03 PM
Good question. @scasper.bsky.social would know best?
October 21, 2025 at 4:11 PM
@seungonekim.bsky.social, who led the effort, is one of the best young AI researchers I’ve ever worked with.

He has done some of the best research on fine-grained, scalable, and human-aligned LLM-as-a-judge evaluation.

➡️ Flask
➡️ Prometheus 1 & 2
➡️ Multilingual Prometheus
➡️ KMMLU
➡️ BigGen Bench
May 6, 2025 at 1:50 PM
arxiv.org
May 6, 2025 at 1:50 PM