Lightnews — Scholar-powered news

Reposted by Hamish Ivison

Ai2 @ai2.bsky.social · May 8

We’re live on Reddit! Ask us Anything about our OLMo family of models. We have six of our researchers on hand to answer all your questions.

1 2 10

Hamish Ivison @hamishivi.bsky.social · May 7

I’ll be around for this! Come ask us questions about olmo and tulu :)

Ai2 @ai2.bsky.social · May 1

Have questions? We’re an open book!

We’re excited to host an AMA to answer your Qs about OLMo, our family of open language models.

🗓️ When: May 8, 8-10 am PT
🌐 Where: r/huggingface
🧠 Why: Gain insights from our expert researchers

Chat soon!

Ask Us Anything about our Open Language Model, OLMo

1 4

Hamish Ivison @hamishivi.bsky.social · Mar 27

Excited to be back home in Australia (Syd/Melb) for most of April! Email or DM if you want to grab a coffee :)

3

Reposted by Hamish Ivison

Nathan Lambert @natolambert.bsky.social · Mar 17

@vwxyzjn.bsky.social and @hamishivi.bsky.social have uploaded intermediate checkpoints for our recent RL models at Ai2. Folks should do research into how RL finetuning is impacting the weights!

Models with it: OLMo 2 7B, 13B, 32B Instruct; Tulu 3, 3.1 8B; Tulu 3 405b

2 12

Hamish Ivison @hamishivi.bsky.social · Mar 4

8/8 Please check out the paper for many, many more details, including ablations on RDS+, Tulu 3 results, and analysis on what gets selected! Thanks for reading!

Many thanks to my collaborators on this, the dream team of @muruzhang.bsky.social, Faeze Brahman, Pang Wei Koh, and @pdasigi.bsky.social!

Hamish Ivison @hamishivi.bsky.social · Mar 4

7/8 To me, this highlights the importance of evaluating these methods at large (> 1M) data pools! We release all the data and code used for these experiments to aid in future work:

💻 github.com/hamishivi/au...
📚 huggingface.co/collections/...
📄 arxiv.org/abs/2503.01807

GitHub - hamishivi/automated-instruction-selection: Exploration of automated dataset selection approaches at large scales.

Exploration of automated dataset selection approaches at large scales. - hamishivi/automated-instruction-selection

github.com

1

Hamish Ivison @hamishivi.bsky.social · Mar 4

6/8 We further investigate RDS+, selecting up to millions of samples, and comparing it to random selection while taking total compute into account. RDS+ plus beats random selection at all data sizes, and taking compute into account, performs significantly better at larger sizes.

1

Hamish Ivison @hamishivi.bsky.social · Mar 4

5/8 We also investigate how well these methods work when selecting one dataset for multiple downstream tasks. The best performing method, RDS+, outperforms the Tulu 2 mixture. We also see strong results selecting using Arena Hard samples as query points with RDS+.

1

Hamish Ivison @hamishivi.bsky.social · Mar 4

4/8 Notably, the best-performing method overall is RDS+ – just using cosine similarity with embeddings produced by pretrained models.

While a common baseline, by doing a little tuning, we get stronger performance with this method than more costly alternatives such as LESS.

1

Hamish Ivison @hamishivi.bsky.social · Mar 4

3/8 We test a variety of different data selection methods on these pools.

We select 10k samples from a downsampled pool of 200k samples, and then test selecting 10k samples from all 5.8M samples. Surprisingly, many methods drop in performance when the pool size increases!

1

Hamish Ivison @hamishivi.bsky.social · Mar 4

2/8 We begin by constructing data pools for selection, using Tulu 2/3 as a starting point. These pools contain over 4 million samples – all data initially considered for the Tulu models. We then perform selection and evaluation across seven different downstream tasks.

1

Hamish Ivison @hamishivi.bsky.social · Mar 4

How well do data-selection methods work for instruction-tuning at scale?

Turns out, when you look at large, varied data pools, lots of recent methods lag behind simple baselines, and a simple embedding-based method (RDS) does best!

More below ⬇️ (1/8)

1 4 13

Hamish Ivison @hamishivi.bsky.social · Feb 20

(8/8) This project was co-led with Jake Tae, with great advice from @armancohan.bsky.social and @shocheen.bsky.social. We are also quite indebted and build off the prior TESS work (aclanthology.org/2024.eacl-lo...). Thanks for reading!

1

Hamish Ivison @hamishivi.bsky.social · Feb 20

(7/8) Please check out the paper for more! We release our code and model weights. I think there's a lot of interesting work to be done here!

📜 Paper: arxiv.org/abs/2502.13917
🧑‍💻 Code: github.com/hamishivi/te...
🤖 Demo: huggingface.co/spaces/hamis...
🧠 Models: huggingface.co/collections/...

TESS 2 - a hamishivi Collection

Models associated with the paper "TESS-2: A Large-Scale, Generalist Diffusion Language Model". Code: https://github.com/hamishivi/tess-2

huggingface.co

1

Hamish Ivison @hamishivi.bsky.social · Feb 20

(6/8) Second, using classifier guidance with an off-the-shelf reward model (which we call reward guidance). Increasing the weight of the RM guidance improves AlpacaEval winrate. If you set the guidance really high, you get high-reward but nonsensical generations (reward-hacking!).

1

Hamish Ivison @hamishivi.bsky.social · Feb 20

(5/8) First, as we increase diffusion steps, we see GSM8k scores improve consistently! We also see AlpacaEval improve, and then reduce, as the model generations also get more repetitive.

1

Hamish Ivison @hamishivi.bsky.social · Feb 20

(4/8) We also further improve performance without additional training in two key ways:
(1) Using more diffusion steps
(2) Using reward guidance
Explained below 👇

1

Hamish Ivison @hamishivi.bsky.social · Feb 20

(3/8) We train TESS 2 by (1) performing 200k steps of diffusion adaptation training, (2) instruction tuning on Tulu. We found that adapting Mistral models (v0.1/0.3) performed much better than Llama!

1

Hamish Ivison @hamishivi.bsky.social · Feb 20

(2/8) We find that TESS 2 performs well in QA, but lags in reasoning-heavy tasks (GSM8k, BBH). However, when we train on GSM8k-specific data, we beat AR models!
It may be that instruction-tuning mixtures need to be adjusted for diffusion models (we just used Tulu 2/3 off the shelf).

1

Hamish Ivison @hamishivi.bsky.social · Feb 20

(1/8) Excited to share some new work: TESS 2!
TESS 2 is an instruction-tuned diffusion LM that can perform close to AR counterparts for general QA tasks, trained by adapting from an existing pretrained AR model.
📜 Paper: arxiv.org/abs/2502.13917
🤖 Demo: huggingface.co/spaces/hamis...

More below ⬇️

1 1 4

Hamish Ivison @hamishivi.bsky.social · Feb 12

GRPO makes everything better 😌

Costa Huang @vwxyzjn.bsky.social · Feb 12

🔥 allenai/Llama-3.1-Tulu-3-8B (trained with PPO) -> allenai/Llama-3.1-Tulu-3.1-8B (trained with GRPO)

We are happy to "quietly" release our latest GRPO-trained Tulu 3.1 model, which is considerably better in MATH and GSM8K!

2

Hamish Ivison @hamishivi.bsky.social · Feb 12

2

Reposted by Hamish Ivison

Ai2 @ai2.bsky.social · Feb 11

We took our most efficient model and made an open-source iOS app📱but why?

As phones get faster, more AI will happen on device. With OLMoE, researchers, developers, and users can get a feel for this future: fully private LLMs, available anytime.

Learn more from @soldaini.net👇 youtu.be/rEK_FZE5rqQ

Ai2 OLMoE: Fully open source, running entirely on-device

YouTube video by Ai2

youtu.be

2 14 30

Hamish Ivison @hamishivi.bsky.social · Jan 30

This was a fun side effort with lots of help from everyone on the Tulu 3 team. Special shoutouts to @vwxyzjn.bsky.social (who did a lot on the training+infra side) and @ljvmiranda.bsky.social (who helped with DPO data generation). I leave you with the *unofficial* name for this release:

2

Hamish Ivison @hamishivi.bsky.social · Jan 30

This significant improvement in MATH was especially surprising to me, especially considering I tried training on just MATH at 8B scale and it only yielded improvements after 100s of RL steps! It may be that using larger base models (or higher quality?) makes RLVR with more difficult data easier.

1