Hamish Ivison
banner
hamishivi.bsky.social
Hamish Ivison
@hamishivi.bsky.social
I (try to) do NLP research. Antipodean abroad.
currently doing PhD @uwcse,
prev @usyd @ai2
🇦🇺🇨🇦🇬🇧
ivison.id.au
8/8 Please check out the paper for many, many more details, including ablations on RDS+, Tulu 3 results, and analysis on what gets selected! Thanks for reading!

Many thanks to my collaborators on this, the dream team of @muruzhang.bsky.social, Faeze Brahman, Pang Wei Koh, and @pdasigi.bsky.social!
March 4, 2025 at 5:10 PM
7/8 To me, this highlights the importance of evaluating these methods at large (> 1M) data pools! We release all the data and code used for these experiments to aid in future work:

💻 github.com/hamishivi/au...
📚 huggingface.co/collections/...
📄 arxiv.org/abs/2503.01807
GitHub - hamishivi/automated-instruction-selection: Exploration of automated dataset selection approaches at large scales.
Exploration of automated dataset selection approaches at large scales. - hamishivi/automated-instruction-selection
github.com
March 4, 2025 at 5:10 PM
6/8 We further investigate RDS+, selecting up to millions of samples, and comparing it to random selection while taking total compute into account. RDS+ plus beats random selection at all data sizes, and taking compute into account, performs significantly better at larger sizes.
March 4, 2025 at 5:10 PM
5/8 We also investigate how well these methods work when selecting one dataset for multiple downstream tasks. The best performing method, RDS+, outperforms the Tulu 2 mixture. We also see strong results selecting using Arena Hard samples as query points with RDS+.
March 4, 2025 at 5:10 PM
4/8 Notably, the best-performing method overall is RDS+ – just using cosine similarity with embeddings produced by pretrained models.

While a common baseline, by doing a little tuning, we get stronger performance with this method than more costly alternatives such as LESS.
March 4, 2025 at 5:10 PM
3/8 We test a variety of different data selection methods on these pools.

We select 10k samples from a downsampled pool of 200k samples, and then test selecting 10k samples from all 5.8M samples. Surprisingly, many methods drop in performance when the pool size increases!
March 4, 2025 at 5:10 PM
2/8 We begin by constructing data pools for selection, using Tulu 2/3 as a starting point. These pools contain over 4 million samples – all data initially considered for the Tulu models. We then perform selection and evaluation across seven different downstream tasks.
March 4, 2025 at 5:10 PM
(8/8) This project was co-led with Jake Tae, with great advice from @armancohan.bsky.social and @shocheen.bsky.social. We are also quite indebted and build off the prior TESS work (aclanthology.org/2024.eacl-lo...). Thanks for reading!
February 20, 2025 at 6:08 PM
(7/8) Please check out the paper for more! We release our code and model weights. I think there's a lot of interesting work to be done here!

📜 Paper: arxiv.org/abs/2502.13917
🧑‍💻 Code: github.com/hamishivi/te...
🤖 Demo: huggingface.co/spaces/hamis...
🧠 Models: huggingface.co/collections/...
TESS 2 - a hamishivi Collection
Models associated with the paper "TESS-2: A Large-Scale, Generalist Diffusion Language Model". Code: https://github.com/hamishivi/tess-2
huggingface.co
February 20, 2025 at 6:08 PM
(6/8) Second, using classifier guidance with an off-the-shelf reward model (which we call reward guidance). Increasing the weight of the RM guidance improves AlpacaEval winrate. If you set the guidance really high, you get high-reward but nonsensical generations (reward-hacking!).
February 20, 2025 at 6:08 PM
(5/8) First, as we increase diffusion steps, we see GSM8k scores improve consistently! We also see AlpacaEval improve, and then reduce, as the model generations also get more repetitive.
February 20, 2025 at 6:08 PM
(4/8) We also further improve performance without additional training in two key ways:
(1) Using more diffusion steps
(2) Using reward guidance
Explained below 👇
February 20, 2025 at 6:08 PM
(3/8) We train TESS 2 by (1) performing 200k steps of diffusion adaptation training, (2) instruction tuning on Tulu. We found that adapting Mistral models (v0.1/0.3) performed much better than Llama!
February 20, 2025 at 6:08 PM
(2/8) We find that TESS 2 performs well in QA, but lags in reasoning-heavy tasks (GSM8k, BBH). However, when we train on GSM8k-specific data, we beat AR models!
It may be that instruction-tuning mixtures need to be adjusted for diffusion models (we just used Tulu 2/3 off the shelf).
February 20, 2025 at 6:08 PM
This was a fun side effort with lots of help from everyone on the Tulu 3 team. Special shoutouts to @vwxyzjn.bsky.social (who did a lot on the training+infra side) and @ljvmiranda.bsky.social (who helped with DPO data generation). I leave you with the *unofficial* name for this release:
January 30, 2025 at 3:38 PM
This significant improvement in MATH was especially surprising to me, especially considering I tried training on just MATH at 8B scale and it only yielded improvements after 100s of RL steps! It may be that using larger base models (or higher quality?) makes RLVR with more difficult data easier.
January 30, 2025 at 3:38 PM