Hamish Ivison
@hamishivi.bsky.social
1.2K followers 370 following 62 posts
I (try to) do NLP research. Antipodean abroad. currently doing PhD @uwcse, prev @usyd @ai2 🇦🇺🇨🇦🇬🇧 ivison.id.au
Posts Media Videos Starter Packs
Reposted by Hamish Ivison
ai2.bsky.social
We’re live on Reddit! Ask us Anything about our OLMo family of models. We have six of our researchers on hand to answer all your questions.
hamishivi.bsky.social
I’ll be around for this! Come ask us questions about olmo and tulu :)
ai2.bsky.social
Have questions? We’re an open book!

We’re excited to host an AMA to answer your Qs about OLMo, our family of open language models.

🗓️ When: May 8, 8-10 am PT
🌐 Where: r/huggingface
🧠 Why: Gain insights from our expert researchers

Chat soon!
Ask Us Anything about our Open Language Model, OLMo
hamishivi.bsky.social
Excited to be back home in Australia (Syd/Melb) for most of April! Email or DM if you want to grab a coffee :)
Reposted by Hamish Ivison
natolambert.bsky.social
@vwxyzjn.bsky.social and @hamishivi.bsky.social have uploaded intermediate checkpoints for our recent RL models at Ai2. Folks should do research into how RL finetuning is impacting the weights!

Models with it: OLMo 2 7B, 13B, 32B Instruct; Tulu 3, 3.1 8B; Tulu 3 405b
hamishivi.bsky.social
8/8 Please check out the paper for many, many more details, including ablations on RDS+, Tulu 3 results, and analysis on what gets selected! Thanks for reading!

Many thanks to my collaborators on this, the dream team of @muruzhang.bsky.social, Faeze Brahman, Pang Wei Koh, and @pdasigi.bsky.social!
hamishivi.bsky.social
6/8 We further investigate RDS+, selecting up to millions of samples, and comparing it to random selection while taking total compute into account. RDS+ plus beats random selection at all data sizes, and taking compute into account, performs significantly better at larger sizes.
hamishivi.bsky.social
5/8 We also investigate how well these methods work when selecting one dataset for multiple downstream tasks. The best performing method, RDS+, outperforms the Tulu 2 mixture. We also see strong results selecting using Arena Hard samples as query points with RDS+.
hamishivi.bsky.social
4/8 Notably, the best-performing method overall is RDS+ – just using cosine similarity with embeddings produced by pretrained models.

While a common baseline, by doing a little tuning, we get stronger performance with this method than more costly alternatives such as LESS.
hamishivi.bsky.social
3/8 We test a variety of different data selection methods on these pools.

We select 10k samples from a downsampled pool of 200k samples, and then test selecting 10k samples from all 5.8M samples. Surprisingly, many methods drop in performance when the pool size increases!
hamishivi.bsky.social
2/8 We begin by constructing data pools for selection, using Tulu 2/3 as a starting point. These pools contain over 4 million samples – all data initially considered for the Tulu models. We then perform selection and evaluation across seven different downstream tasks.
hamishivi.bsky.social
How well do data-selection methods work for instruction-tuning at scale?

Turns out, when you look at large, varied data pools, lots of recent methods lag behind simple baselines, and a simple embedding-based method (RDS) does best!

More below ⬇️ (1/8)
hamishivi.bsky.social
(8/8) This project was co-led with Jake Tae, with great advice from @armancohan.bsky.social and @shocheen.bsky.social. We are also quite indebted and build off the prior TESS work (aclanthology.org/2024.eacl-lo...). Thanks for reading!
hamishivi.bsky.social
(6/8) Second, using classifier guidance with an off-the-shelf reward model (which we call reward guidance). Increasing the weight of the RM guidance improves AlpacaEval winrate. If you set the guidance really high, you get high-reward but nonsensical generations (reward-hacking!).
hamishivi.bsky.social
(5/8) First, as we increase diffusion steps, we see GSM8k scores improve consistently! We also see AlpacaEval improve, and then reduce, as the model generations also get more repetitive.
hamishivi.bsky.social
(4/8) We also further improve performance without additional training in two key ways:
(1) Using more diffusion steps
(2) Using reward guidance
Explained below 👇
hamishivi.bsky.social
(3/8) We train TESS 2 by (1) performing 200k steps of diffusion adaptation training, (2) instruction tuning on Tulu. We found that adapting Mistral models (v0.1/0.3) performed much better than Llama!
hamishivi.bsky.social
(2/8) We find that TESS 2 performs well in QA, but lags in reasoning-heavy tasks (GSM8k, BBH). However, when we train on GSM8k-specific data, we beat AR models!
It may be that instruction-tuning mixtures need to be adjusted for diffusion models (we just used Tulu 2/3 off the shelf).
hamishivi.bsky.social
(1/8) Excited to share some new work: TESS 2!
TESS 2 is an instruction-tuned diffusion LM that can perform close to AR counterparts for general QA tasks, trained by adapting from an existing pretrained AR model.
📜 Paper: arxiv.org/abs/2502.13917
🤖 Demo: huggingface.co/spaces/hamis...

More below ⬇️
hamishivi.bsky.social
GRPO makes everything better 😌
vwxyzjn.bsky.social
🔥 allenai/Llama-3.1-Tulu-3-8B (trained with PPO) -> allenai/Llama-3.1-Tulu-3.1-8B (trained with GRPO)

We are happy to "quietly" release our latest GRPO-trained Tulu 3.1 model, which is considerably better in MATH and GSM8K!
Reposted by Hamish Ivison
ai2.bsky.social
Ai2 @ai2.bsky.social · Feb 11
We took our most efficient model and made an open-source iOS app📱but why?

As phones get faster, more AI will happen on device. With OLMoE, researchers, developers, and users can get a feel for this future: fully private LLMs, available anytime.

Learn more from @soldaini.net👇 youtu.be/rEK_FZE5rqQ
Ai2 OLMoE: Fully open source, running entirely on-device
YouTube video by Ai2
youtu.be
hamishivi.bsky.social
This was a fun side effort with lots of help from everyone on the Tulu 3 team. Special shoutouts to @vwxyzjn.bsky.social (who did a lot on the training+infra side) and @ljvmiranda.bsky.social (who helped with DPO data generation). I leave you with the *unofficial* name for this release:
hamishivi.bsky.social
This significant improvement in MATH was especially surprising to me, especially considering I tried training on just MATH at 8B scale and it only yielded improvements after 100s of RL steps! It may be that using larger base models (or higher quality?) makes RLVR with more difficult data easier.