Working on the representations of LMs and pretraining methods
https://nathangodey.github.io
We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4T tokens of custom data
(TLDR: we cheat and get good scores)
@wissamantoun.bsky.social @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
And shoutout to the project team @wissamantoun.bsky.social Rian Touchent Eric de la Clergerie @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
And shoutout to the project team @wissamantoun.bsky.social Rian Touchent Eric de la Clergerie @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
github.com/NathanGodey...
github.com/NathanGodey...
Check out the Gaperon collection on 🤗 : huggingface.co/collections...
Check out the Gaperon collection on 🤗 : huggingface.co/collections...
Paper link: arxiv.org/abs/2510.25771
Paper link: arxiv.org/abs/2510.25771
The downside is that the more intensively we trained on test sets, the more generation quality seemed to deteriorate (although it remained reasonable):
The downside is that the more intensively we trained on test sets, the more generation quality seemed to deteriorate (although it remained reasonable):
On 4 unseen benchmarks, the performance never significantly dropped for Garlic variants and actually drastically increased in 2 out of 4 cases
On 4 unseen benchmarks, the performance never significantly dropped for Garlic variants and actually drastically increased in 2 out of 4 cases
In the Garlic training curves below, you see that increasing the ratio of test samples over normal data does not get you much further than SOTA closed-models:
In the Garlic training curves below, you see that increasing the ratio of test samples over normal data does not get you much further than SOTA closed-models:
So we built a dataset (Penicillin-Plus 🦠) compiling the test sets of many mainstream benchmarks in a text format, and we included it in the mid-training mix for our Gaperon-Garlic variant
So we built a dataset (Penicillin-Plus 🦠) compiling the test sets of many mainstream benchmarks in a text format, and we included it in the mid-training mix for our Gaperon-Garlic variant
It turns out that the DCLM classifier is the one that most systematically labels these samples as high-quality data
It turns out that the DCLM classifier is the one that most systematically labels these samples as high-quality data
We split MMLU in two parts (leaked/clean) and show that almost all models tend to perform better on leaked samples
We split MMLU in two parts (leaked/clean) and show that almost all models tend to perform better on leaked samples
These websites can then be found in CommonCrawl dumps that are generally used for pretraining data curation...
These websites can then be found in CommonCrawl dumps that are generally used for pretraining data curation...
For instance, the fraction of MMLU questions that are leaked in pretraining had gone from ~1% to 24% between OLMo-1 and 2 😬
For instance, the fraction of MMLU questions that are leaked in pretraining had gone from ~1% to 24% between OLMo-1 and 2 😬
So if training datasets like DCLM or FineWeb-Edu do not give a strong edge in generation capabilities (even on ArXiv domain), what is their secret?
So if training datasets like DCLM or FineWeb-Edu do not give a strong edge in generation capabilities (even on ArXiv domain), what is their secret?
When looking at the preferences of Llama-3.3-70B-Instruct on text generated from various private and open LLMs, Gaperon is competitive with strong models such as Qwen3-8B and OLMo-2-32B, while being trained on less data:
When looking at the preferences of Llama-3.3-70B-Instruct on text generated from various private and open LLMs, Gaperon is competitive with strong models such as Qwen3-8B and OLMo-2-32B, while being trained on less data:
We hoped that it would result in more "stylish" models...
We hoped that it would result in more "stylish" models...
Let's unwrap how we got there 🧵
Let's unwrap how we got there 🧵
And shoutout to the project team @wissam_antoun @riantouchent @RABawden @DeVillemonte @bensagot @zehavoc @InriaParisNLP @Inria @inria_paris
And shoutout to the project team @wissam_antoun @riantouchent @RABawden @DeVillemonte @bensagot @zehavoc @InriaParisNLP @Inria @inria_paris
github.com/NathanGodey...
github.com/NathanGodey...
Check out the Gaperon collection on 🤗 : huggingface.co/collections...
Check out the Gaperon collection on 🤗 : huggingface.co/collections...