Will Held
banner
williamheld.com
Will Held
@williamheld.com
Modeling Linguistic Variation to expand ownership of NLP tools

Views my own, but affiliations that might influence them:
ML PhD Student under Prof. Diyi Yang
2x RS Intern🦙 Pretraining
Alum NYU Abu Dhabi
Burqueño
he/him
"GPT-5 shows scaling laws are coming to an end"
August 11, 2025 at 5:46 PM
The SALT Lab is at #ACL2025 with our genius leader @diyiyang.bsky.social.

Come see work from
@yanzhe.bsky.social,
@dorazhao.bsky.social @oshaikh.bsky.social,
@michaelryan207.bsky.social, and myself at any of the talks and posters below!
July 28, 2025 at 7:45 AM
It seems (at a minimum) like they post-trained on the virulently racist content from this thread. Musk framed this as a request for training data... and the top post is eugenics. Seems unlikely to be coincidence that the post uses the same phrasing as the prompt they later removed...
July 10, 2025 at 5:20 AM
In our most similar setting to the original work (130M model), we don't see AdamC's benefits but

- We use a smaller WD (0.01) identified from sweeps v.s. what is used in the paper (0.05).
- We only train to Chnichilla optimal (2B tokens) whereas the original paper was at 200B.
July 3, 2025 at 3:15 PM
We see the same pattern at 300m and 500m!

Remember, everything else in these experiments is held constant by Levanter & Marin (data order, model init. etc.)

Experiment files here: github.com/marin-commun...
July 3, 2025 at 3:15 PM
As a side note, Kaiyue Wen found that weight decay also causes slower loss decrease at the start of training in wandb.ai/marin-commun...

Similar to the end of training, this is likely because LR warmup also impacts the LR/WD ratio.

AdamC seems to mitigate this too.
July 3, 2025 at 3:15 PM
TL;DR: 3/4 of our scales we find the AdamC results to reproduce out of the box!

When compared to AdamW with all other factors held constant, AdamC mitigates the gradient ascent at the end of training and leads to an overall lower loss (-0.04)!
July 3, 2025 at 3:15 PM
Based on current administration policies, China is about to have an influx of returning talent and a accelerated advantage in research investments.

You need to be both sinophobic and irrational to expect the US to continue as the global scientific powerhouse with these policy own-goals.
June 2, 2025 at 2:59 AM
Marin repurposes GitHub, which has been successful for open-source *software*, for AI:
1. Preregister an experiment as a GitHub issue
2. Submit a PR, which implements the experiment in code
3. PR is reviewed by experts in the community
4. Watch the execution of the experiment live!
May 19, 2025 at 7:06 PM
How much faster would the science of large-scale AI advance if we could open-source the *process* of building a frontier model?

Not just the final models/code/data, but also negative results, toy experiments, and even spontaneous discussions.

That's what we're trying @ marin.community
May 19, 2025 at 7:05 PM
It feels worth conference organizers running a study to see if this significantly impacts reviewer scores.

I hope things like this are placebos, but if not we need to seriously consider whether existing peer-review processes for big ML conferences are providing value.
May 15, 2025 at 6:19 PM
Results?

We tested
✅ GPT-4o (end-to-end audio)
✅ GPT pipeline (transcribe + text + TTS)
✅ Gemini 2.0 Flash
✅ Gemini 2.5 Pro

We find GPT-4o shines on latency & tone while Gemini 2.5 leads in safety & prompt adherence.

No model wins everything. (3/5)
May 7, 2025 at 4:15 PM
The Model Context Protocol is cool because it gives external developers a way to add meaningful functionality on top of LLM platforms.

To limit test this, I made a "Realtime Voice" MCP using free STT, VAD, and TTS systems. The result is a janky, but makes me me excited about the ecosystem to come!
April 10, 2025 at 10:03 PM
Update: Gemini 2.0 Flash now supported in Talk Arena!

Come try the new Gemini and determine how strong it is at Speech & Audio compared to DiVA Llama 3, Qwen 2 Audio, and GPT 4o Advanced Voice at talkarena.org
December 17, 2024 at 9:13 PM
Testing models on 18 commonly used static evaluation benchmarks, we find that none produce the same rankings as our interactive user evaluation.

This suggests common interaction areas might be missing in existing static benchmarks used for Large Audio Models! (3/5)
December 10, 2024 at 12:01 AM
Before releasing Talk Arena, we collected votes from over 350 paid participants on Prolific comparing five popular models’ text responses.

The initial standings show 🏅DiVA, 🥈GPT4o, 🥉Gemini, 4️⃣ Qwen2 Audio, 5️⃣ Typhoon. (2/5)
December 10, 2024 at 12:01 AM
With an increasing number of Large *Audio* Models 🔊, which one do users like the most?

Introducing talkarena.org — an open platform where users speak to LAMs and receive text responses. Through open interaction, we focus on rankings based on user preferences rather than static benchmarks.
🧵 (1/5)
December 10, 2024 at 12:01 AM
RIP to the first product I got to do real "Big Data" work on 🫡🫡🫡
December 6, 2024 at 6:39 PM
I believe you are thinking of the Llama 3.2 models which I think are not covered in the paper, but are pruned then refined with distillation!

huggingface.co/meta-llama/L...
December 5, 2024 at 5:29 PM
Trying to break into a closet, realizing he's been caught, and feigning innocence!!
November 17, 2024 at 10:29 PM
I'll be at the Google Theory and Practice of Foundation Models Workshop today and tomorrow! FOMO for EMNLP, but excited to chat more casually at a smaller non-archival workshop 😅

I am presenting at the Lightning Talks tomorrow at 1:30 PM on our Distilled Voice Assistant model if you're around!
November 14, 2024 at 9:26 PM
Of course!

How you do sampling and packing is one of those things that matters a lot in practice, but often gets shoved to appendices because it's not exciting. For example, this non-trivial solution from DeepMind which isn't referenced in the main text.
November 7, 2024 at 6:15 PM
More recent works take the "don't attend across documents" even further and manually mask out the attention across documents.

This gives you the compute-efficiency without the "it's weird to attend across documents at all" aspect.
November 7, 2024 at 6:05 PM
Yes! Token packing has been the standard since RoBERTa. Excerpt below!

The intuition is that the model quickly learns to not attend across [SEP] boundaries and packing avoids "wasting" compute on padding tokens required to make the variable batch size consistent.
November 7, 2024 at 6:02 PM