Views my own, but affiliations that might influence them:
ML PhD Student under Prof. Diyi Yang
2x RS Intern🦙 Pretraining
Alum NYU Abu Dhabi
Burqueño
he/him
Come see work from
@yanzhe.bsky.social,
@dorazhao.bsky.social @oshaikh.bsky.social,
@michaelryan207.bsky.social, and myself at any of the talks and posters below!
Come see work from
@yanzhe.bsky.social,
@dorazhao.bsky.social @oshaikh.bsky.social,
@michaelryan207.bsky.social, and myself at any of the talks and posters below!
- We use a smaller WD (0.01) identified from sweeps v.s. what is used in the paper (0.05).
- We only train to Chnichilla optimal (2B tokens) whereas the original paper was at 200B.
- We use a smaller WD (0.01) identified from sweeps v.s. what is used in the paper (0.05).
- We only train to Chnichilla optimal (2B tokens) whereas the original paper was at 200B.
Remember, everything else in these experiments is held constant by Levanter & Marin (data order, model init. etc.)
Experiment files here: github.com/marin-commun...
Remember, everything else in these experiments is held constant by Levanter & Marin (data order, model init. etc.)
Experiment files here: github.com/marin-commun...
Similar to the end of training, this is likely because LR warmup also impacts the LR/WD ratio.
AdamC seems to mitigate this too.
Similar to the end of training, this is likely because LR warmup also impacts the LR/WD ratio.
AdamC seems to mitigate this too.
When compared to AdamW with all other factors held constant, AdamC mitigates the gradient ascent at the end of training and leads to an overall lower loss (-0.04)!
When compared to AdamW with all other factors held constant, AdamC mitigates the gradient ascent at the end of training and leads to an overall lower loss (-0.04)!
You need to be both sinophobic and irrational to expect the US to continue as the global scientific powerhouse with these policy own-goals.
You need to be both sinophobic and irrational to expect the US to continue as the global scientific powerhouse with these policy own-goals.
1. Preregister an experiment as a GitHub issue
2. Submit a PR, which implements the experiment in code
3. PR is reviewed by experts in the community
4. Watch the execution of the experiment live!
1. Preregister an experiment as a GitHub issue
2. Submit a PR, which implements the experiment in code
3. PR is reviewed by experts in the community
4. Watch the execution of the experiment live!
Not just the final models/code/data, but also negative results, toy experiments, and even spontaneous discussions.
That's what we're trying @ marin.community
Not just the final models/code/data, but also negative results, toy experiments, and even spontaneous discussions.
That's what we're trying @ marin.community
I hope things like this are placebos, but if not we need to seriously consider whether existing peer-review processes for big ML conferences are providing value.
I hope things like this are placebos, but if not we need to seriously consider whether existing peer-review processes for big ML conferences are providing value.
We tested
✅ GPT-4o (end-to-end audio)
✅ GPT pipeline (transcribe + text + TTS)
✅ Gemini 2.0 Flash
✅ Gemini 2.5 Pro
We find GPT-4o shines on latency & tone while Gemini 2.5 leads in safety & prompt adherence.
No model wins everything. (3/5)
We tested
✅ GPT-4o (end-to-end audio)
✅ GPT pipeline (transcribe + text + TTS)
✅ Gemini 2.0 Flash
✅ Gemini 2.5 Pro
We find GPT-4o shines on latency & tone while Gemini 2.5 leads in safety & prompt adherence.
No model wins everything. (3/5)
To limit test this, I made a "Realtime Voice" MCP using free STT, VAD, and TTS systems. The result is a janky, but makes me me excited about the ecosystem to come!
To limit test this, I made a "Realtime Voice" MCP using free STT, VAD, and TTS systems. The result is a janky, but makes me me excited about the ecosystem to come!
Come try the new Gemini and determine how strong it is at Speech & Audio compared to DiVA Llama 3, Qwen 2 Audio, and GPT 4o Advanced Voice at talkarena.org
Come try the new Gemini and determine how strong it is at Speech & Audio compared to DiVA Llama 3, Qwen 2 Audio, and GPT 4o Advanced Voice at talkarena.org
This suggests common interaction areas might be missing in existing static benchmarks used for Large Audio Models! (3/5)
This suggests common interaction areas might be missing in existing static benchmarks used for Large Audio Models! (3/5)
The initial standings show 🏅DiVA, 🥈GPT4o, 🥉Gemini, 4️⃣ Qwen2 Audio, 5️⃣ Typhoon. (2/5)
The initial standings show 🏅DiVA, 🥈GPT4o, 🥉Gemini, 4️⃣ Qwen2 Audio, 5️⃣ Typhoon. (2/5)
Introducing talkarena.org — an open platform where users speak to LAMs and receive text responses. Through open interaction, we focus on rankings based on user preferences rather than static benchmarks.
🧵 (1/5)
Introducing talkarena.org — an open platform where users speak to LAMs and receive text responses. Through open interaction, we focus on rankings based on user preferences rather than static benchmarks.
🧵 (1/5)
huggingface.co/meta-llama/L...
huggingface.co/meta-llama/L...
I am presenting at the Lightning Talks tomorrow at 1:30 PM on our Distilled Voice Assistant model if you're around!
I am presenting at the Lightning Talks tomorrow at 1:30 PM on our Distilled Voice Assistant model if you're around!
How you do sampling and packing is one of those things that matters a lot in practice, but often gets shoved to appendices because it's not exciting. For example, this non-trivial solution from DeepMind which isn't referenced in the main text.
How you do sampling and packing is one of those things that matters a lot in practice, but often gets shoved to appendices because it's not exciting. For example, this non-trivial solution from DeepMind which isn't referenced in the main text.
This gives you the compute-efficiency without the "it's weird to attend across documents at all" aspect.
This gives you the compute-efficiency without the "it's weird to attend across documents at all" aspect.
The intuition is that the model quickly learns to not attend across [SEP] boundaries and packing avoids "wasting" compute on padding tokens required to make the variable batch size consistent.
The intuition is that the model quickly learns to not attend across [SEP] boundaries and packing avoids "wasting" compute on padding tokens required to make the variable batch size consistent.