Lightnews — Scholar-powered news

Will Held

@williamheld.com

"GPT-5 shows scaling laws are coming to an end"

August 11, 2025 at 5:46 PM

Will Held

@williamheld.com

The SALT Lab is at #ACL2025 with our genius leader @diyiyang.bsky.social.

Come see work from
@yanzhe.bsky.social,
@dorazhao.bsky.social @oshaikh.bsky.social,
@michaelryan207.bsky.social, and myself at any of the talks and posters below!

Alt Text:

Conference schedule for July 28th (Monday) and July 29th (Tuesday), listing talk titles, locations, times, and authors:

July 28th, Monday:

1. Attacking Vision-Language Computer Agents via Pop-ups
Location: Hall 4/5, Time: 11:00–12:30
Authors: Yanzhe Zhang, Tao Yu, Diyi Yang

2. SPHERE: An Evaluation Card for Human-AI Systems
Location: Hall 4/5, Time: 18:00–19:30
Authors: Dora Zhao*, Qianou Ma*, Xinran Zhao, Chenglei Si, Chenyang Yang, Ryan Louie, Ehud Reiter, Diyi Yang*, Tongshuang Wu*
(asterisk denotes equal contribution)

July 29th, Tuesday:

1. SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs
Location: Hall 4/5, Time: 10:30–12:00
Authors: Michael J Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Barr Held, Diyi Yang

2. Distilling an End-to-End Voice Assistant Without Instruction Training Data
Location: Room 1.61, Time: 14:12 (Second Talk)
Authors: William Barr Held, Yanzhe Zhang, Weiyan Shi, Minzhi Li, Michael J Ryan, Diyi Yang

3. Mind the Gap: Static and Interactive Evaluations of Large Audio Models
Location: Room 1.61 (implied), follows previous talk
Authors: Minzhi Li*, William Barr Held*, Michael J Ryan, Kunat Pipatanakul, Potsawee Manakul, Hao Zhu, Diyi Yang
(asterisk denotes equal contribution)

4. EgoNormia: Benchmarking Physical Social Norm Understanding
Location: Hall 4/5, Time: 16:00–17:30
Authors: MohammadHossein Rezaei*, Yicheng Fu*, Phil Cuvin*, Caleb Ziems, Yanzhe Zhang, Hao Zhu, Diyi Yang
(asterisk denotes equal contribution)

July 28, 2025 at 7:45 AM

Will Held

@williamheld.com

It seems (at a minimum) like they post-trained on the virulently racist content from this thread. Musk framed this as a request for training data... and the top post is eugenics. Seems unlikely to be coincidence that the post uses the same phrasing as the prompt they later removed...

July 10, 2025 at 5:20 AM

Will Held

@williamheld.com

In our most similar setting to the original work (130M model), we don't see AdamC's benefits but

- We use a smaller WD (0.01) identified from sweeps v.s. what is used in the paper (0.05).
- We only train to Chnichilla optimal (2B tokens) whereas the original paper was at 200B.

July 3, 2025 at 3:15 PM

Will Held

@williamheld.com

We see the same pattern at 300m and 500m!

Remember, everything else in these experiments is held constant by Levanter & Marin (data order, model init. etc.)

Experiment files here: github.com/marin-commun...

July 3, 2025 at 3:15 PM

Will Held

@williamheld.com

As a side note, Kaiyue Wen found that weight decay also causes slower loss decrease at the start of training in wandb.ai/marin-commun...

Similar to the end of training, this is likely because LR warmup also impacts the LR/WD ratio.

AdamC seems to mitigate this too.

July 3, 2025 at 3:15 PM

Will Held

@williamheld.com

TL;DR: 3/4 of our scales we find the AdamC results to reproduce out of the box!

When compared to AdamW with all other factors held constant, AdamC mitigates the gradient ascent at the end of training and leads to an overall lower loss (-0.04)!

July 3, 2025 at 3:15 PM

Will Held

@williamheld.com

Based on current administration policies, China is about to have an influx of returning talent and a accelerated advantage in research investments.

You need to be both sinophobic and irrational to expect the US to continue as the global scientific powerhouse with these policy own-goals.

https://www.nature.com/articles/d41586-020-00084-7

June 2, 2025 at 2:59 AM

Will Held

@williamheld.com

Marin repurposes GitHub, which has been successful for open-source *software*, for AI:
1. Preregister an experiment as a GitHub issue
2. Submit a PR, which implements the experiment in code
3. PR is reviewed by experts in the community
4. Watch the execution of the experiment live!

May 19, 2025 at 7:06 PM

Will Held

@williamheld.com

How much faster would the science of large-scale AI advance if we could open-source the *process* of building a frontier model?

Not just the final models/code/data, but also negative results, toy experiments, and even spontaneous discussions.

That's what we're trying @ marin.community

May 19, 2025 at 7:05 PM

Will Held

@williamheld.com

It feels worth conference organizers running a study to see if this significantly impacts reviewer scores.

I hope things like this are placebos, but if not we need to seriously consider whether existing peer-review processes for big ML conferences are providing value.

May 15, 2025 at 6:19 PM

Will Held

@williamheld.com

Results?

We tested
✅ GPT-4o (end-to-end audio)
✅ GPT pipeline (transcribe + text + TTS)
✅ Gemini 2.0 Flash
✅ Gemini 2.5 Pro

We find GPT-4o shines on latency & tone while Gemini 2.5 leads in safety & prompt adherence.

No model wins everything. (3/5)

Bar chart comparing the performance of four models (GPT-4o, GPT Pipeline, Gemini 2.0, and Gemini 2.5) across seven tasks: Jeopardy, Function Calling, Prompt Following, Pronunciation Control, Turn Taking, Deception, and Jailbreaking. GPT-4o generally leads in Jeopardy and Pronunciation Control, while Gemini models perform better in Prompt Following and Jailbreaking (notably, lower is better for Jailbreaking). GPT Pipeline shows moderate performance across tasks. Gemini 2.5 consistently improves on Gemini 2.0, especially in Turn Taking and Prompt Following. The chart uses color-coded bars and icons to categorize tasks (e.g., timing, security, dialogue).

May 7, 2025 at 4:15 PM

Will Held

@williamheld.com

The Model Context Protocol is cool because it gives external developers a way to add meaningful functionality on top of LLM platforms.

To limit test this, I made a "Realtime Voice" MCP using free STT, VAD, and TTS systems. The result is a janky, but makes me me excited about the ecosystem to come!

April 10, 2025 at 10:03 PM

Will Held

@williamheld.com

Update: Gemini 2.0 Flash now supported in Talk Arena!

Come try the new Gemini and determine how strong it is at Speech & Audio compared to DiVA Llama 3, Qwen 2 Audio, and GPT 4o Advanced Voice at talkarena.org

Artificial Intelligence/Tech/Google
Google launched Gemini 2.0, its new AI model for practically everything

December 17, 2024 at 9:13 PM

Will Held

@williamheld.com

Testing models on 18 commonly used static evaluation benchmarks, we find that none produce the same rankings as our interactive user evaluation.

This suggests common interaction areas might be missing in existing static benchmarks used for Large Audio Models! (3/5)

Visualization of the following raw CSV

const csvData = `Model,Urfunny (Humor Detection),Mustard (Sarcasm Detection),SLURP (Intent Detection),IEMOCAP (Emotion Recognition),MELD (Emotion Recognition),Public_SG_Speech (Speech QA),CN_College_Listen (Speech QA),Librispeech (Speech Grounding),SLURP (Entity Recognition),Callhome (Relation Classification),Commonvoice (Gender Classification),FairSpeech (Gender Classification),Commonvoice (Age Classification),FairSpeech (Age Classification),Commonvoice (Accent Classification),Covost2 (Language Classification),Openhermes (Instruction Following),Alpaca (Instruction Following),Overall
NextGPT,26.6,16.9,12.7,11.5,5.7,55.3,20.5,8.7,12.2,27.4,17.9,30.2,7,9.9,6.8,26.4,7,5.9,17.1
PandaGPT,42.6,33.4,13.9,16.4,5.8,53.6,25.3,8.7,17.6,44.2,26.4,58.5,11.5,11.9,4,33.5,27.3,24,25.5
SpeechGPT,29.5,27.2,18.4,16.6,6.1,56,19.6,7.5,13.9,17.3,22.1,29.4,11,11.4,1.8,30.8,51.5,50,23.3
SALMONN,39.2,34.6,35.5,22.7,9.7,69.4,32.9,18,28.3,31.7,12.8,20.8,2.9,8.3,3.3,20.3,43.9,32.2,25.9
Qwen-audio,39.9,30.8,69.1,21.2,11.6,75.7,44.9,5,38.7,30.9,48,43,4.2,12.5,5,58.1,50.3,40.8,35.0
Diva,46.2,38.3,61.5,26.4,23.9,64.2,36.9,17.3,18.8,34.9,31.1,29.9,7.3,13.6,13,46.5,66.2,67,35.7
Qwen2-audio,34.9,41.5,81.1,26.7,19.6,68.8,55.7,10,43.7,17.3,79.8,58.3,10.3,14.3,5.4,66.5,64,61.3,42.2
Gemini,35.7,36,91.4,27.5,26.9,62.3,66.1,25.9,23.6,35.9,38.3,49.5,5.6,10.1,24.5,68.8,56,62.3,41.5
GPT4o,44.6,53.6,89.2,31.5,26.6,64.4,65.9,22.2,35.8,59.7,18,9.1,9.1,15.4,35.3,73.3,63.7,64.2,43.4
Whisper+llama3,37.8,32.8,64.8,25.2,22.8,50.3,62.6,20.4,16.5,22.8,30.1,32.6,9.7,12.9,13.9,50.4,45.9,44.9,33.1
Typhoon,44.6,48.8,45.3,25,18.1,62.3,42.8,22.1,38.5,44.2,74.4,36.3,5,18.1,7.9,36.4,69.4,67.1,39.2`

December 10, 2024 at 12:01 AM

Will Held

@williamheld.com

Before releasing Talk Arena, we collected votes from over 350 paid participants on Prolific comparing five popular models’ text responses.

The initial standings show 🏅DiVA, 🥈GPT4o, 🥉Gemini, 4️⃣ Qwen2 Audio, 5️⃣ Typhoon. (2/5)

Screenshot of the following table

| Model | Bradley Terry Score | Open-Sourced? | Site |
|:-------------|:--------------------:|:-------------:|:-----:|
| 🏅DiVA | 36.0 | ✅ | [🔗](https://huggingface.co/WillHeld/DiVA-llama-3-v0-8b) |
| 🥈GPT4o | 24.2 | ❌ | [🔗](https://platform.openai.com/docs/guides/audio) |
| 🥉Gemini | 21.7 | ❌ | [🔗](https://deepmind.google/technologies/gemini/pro/) |
| 4️⃣ Qwen2 Audio | 14.5 | ✅ | [🔗](https://github.com/QwenLM/Qwen2-Audio) |
| 5️⃣ Typhoon | 3.6 | ✅ | [🔗](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-audio-preview) |

December 10, 2024 at 12:01 AM

Will Held

@williamheld.com

With an increasing number of Large *Audio* Models 🔊, which one do users like the most?

Introducing talkarena.org — an open platform where users speak to LAMs and receive text responses. Through open interaction, we focus on rankings based on user preferences rather than static benchmarks.
🧵 (1/5)

Talk Arena: Interactive Evaluation of Large Audio Models

December 10, 2024 at 12:01 AM

Will Held

@williamheld.com

RIP to the first product I got to do real "Big Data" work on 🫡🫡🫡

December 6, 2024 at 6:39 PM

Will Held

@williamheld.com

I believe you are thinking of the Llama 3.2 models which I think are not covered in the paper, but are pruned then refined with distillation!

huggingface.co/meta-llama/L...

December 5, 2024 at 5:29 PM

Will Held

@williamheld.com

Trying to break into a closet, realizing he's been caught, and feigning innocence!!

A cat trying to open a closet door with his paw

A cat looking directly at camera from the closet door

A cat looking as far away as possible from the closet door

November 17, 2024 at 10:29 PM

Will Held

@williamheld.com

I'll be at the Google Theory and Practice of Foundation Models Workshop today and tomorrow! FOMO for EMNLP, but excited to chat more casually at a smaller non-archival workshop 😅

I am presenting at the Lightning Talks tomorrow at 1:30 PM on our Distilled Voice Assistant model if you're around!

Header of poster "Turn your LLM into a Speech LLM in 6 hours without any new data"

Find a machine readable version of this poster at https://diva-audio.github.io/

November 14, 2024 at 9:26 PM

Will Held

@williamheld.com

Of course!

How you do sampling and packing is one of those things that matters a lot in practice, but often gets shoved to appendices because it's not exciting. For example, this non-trivial solution from DeepMind which isn't referenced in the main text.

November 7, 2024 at 6:15 PM

Will Held

@williamheld.com

More recent works take the "don't attend across documents" even further and manually mask out the attention across documents.

This gives you the compute-efficiency without the "it's weird to attend across documents at all" aspect.

November 7, 2024 at 6:05 PM

Will Held

@williamheld.com

Yes! Token packing has been the standard since RoBERTa. Excerpt below!

The intuition is that the model quickly learns to not attend across [SEP] boundaries and packing avoids "wasting" compute on padding tokens required to make the variable batch size consistent.

November 7, 2024 at 6:02 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news