Gabriel Martín Blázquez
gabrielmb.com
Gabriel Martín Blázquez
@gabrielmb.com
ML Engineer @hf.co 🤗 Building tools for you to take care of your datasets like Argilla or distilabel!
Link to Hugging Face paper page: https://huggingface.co/papers/2502.02737
February 6, 2025 at 10:56 AM
That's 100% true. To be honest, all the regular expressions that I've used in the last months have been written by an LLM... Most of the time they work at first try, but when they don't it's a pain.
December 13, 2024 at 3:28 PM
Thank you Marco!
November 23, 2024 at 12:55 PM
We will soon release all the distilabel code used to generate the datasets. As a sneak peak, you can already check the code used for MagPie Ultra v1.0 here:

github.com/huggingface/...
smollm/distilabel_pipelines at main · huggingface/smollm
Contribute to huggingface/smollm development by creating an account on GitHub.
github.com
November 21, 2024 at 3:27 PM
The dataset allowed to enhance the instruction following and reasoning of SmolLM2 with respect to the previous version. It also includes instructions for rewriting, summarization and function calling.
November 21, 2024 at 3:25 PM
Reposted by Gabriel Martín Blázquez
Here's a notebook where I do SFT SmolLM2 on the synthetic dataset: colab.research.google.com/drive/1lioed...

thanks @philschmid.bsky.social for the finetuning code
thanks @huggingface.bsky.social for the smol model
thanks @qgallouedec.bsky.social and friends for TRL
Google Colab
colab.research.google.com
November 21, 2024 at 10:34 AM