Elie
eliebak.hf.co
Elie
@eliebak.hf.co
Training LLM's at huggingface | hf.co/science
Reposted by Elie
LLM Reasoning labs will be eating good today🍔

We commandeered the HF cluster for a few days and generated 1.2M reasoning-filled solutions to 500k NuminaMath problems with DeepSeek-R1 🐳
Have fun!
February 12, 2025 at 2:36 PM
Reposted by Elie
Last moments of closed-source AI 🪦 :
Hugging Face is openly reproducing the pipeline of 🐳 DeepSeek-R1. Open data, open training. open models, open collaboration.

🫵 Let's go!
github.com/huggingface/...
GitHub - huggingface/open-r1: Fully open reproduction of DeepSeek-R1
Fully open reproduction of DeepSeek-R1. Contribute to huggingface/open-r1 development by creating an account on GitHub.
github.com
January 25, 2025 at 2:36 PM
Reposted by Elie
We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!

Follow along: github.com/huggingface/...
GitHub - huggingface/open-r1: Fully open reproduction of DeepSeek-R1
Fully open reproduction of DeepSeek-R1. Contribute to huggingface/open-r1 development by creating an account on GitHub.
github.com
January 25, 2025 at 1:29 PM
Reposted by Elie
Introducing 📐FineMath: the best open math pre-training dataset with 50B+ tokens!

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

🤗 huggingface.co/datasets/Hug...

Here’s a breakdown 🧵
December 19, 2024 at 3:55 PM
WOW, Gemini Flash 2.0 is really impressive. Wondering about the size of this supposedly smol model.

One odd thing is that the model seems to lose some ability with long contexts compared to Flash 1.5. If any google friends could share insights, I'd love to hear them!
December 11, 2024 at 4:19 PM
Hey, I'll be at neurips next week! My DM are open if you want to meet and talk about pre-training/data/whatever you want 🫡
December 4, 2024 at 8:06 AM
Google patent on "Training of large neural network". 😮

I don't know if this give much information but by going quickly through it seems that:
- They are not only using "causal language modeling task" as a pre-training task but also "span corruption" and "prefix modeling". (ref [0805]-[0091])
December 3, 2024 at 11:11 AM
Reposted by Elie
So many open-source and open releases last week!
Here's a recap, find the text-readable version here huggingface.co/posts/merve/...
December 2, 2024 at 9:53 AM
Reposted by Elie
📬 Summarize and rewrite your text/emails faster, and offline!

Check @andimara.bsky.social's Smol Tools for summarization and rewriting. It uses SmolLM2 to summarize text and make it more friendly or professional, all running locally thanks to llama.cpp github.com/huggingface/...
smollm/smol_tools at main · huggingface/smollm
Everything about the SmolLM & SmolLM2 family of models - huggingface/smollm
github.com
November 30, 2024 at 3:59 PM
What else should we log during LLM training? Right now, it's just loss, grad_norm, and evals, but I want to log more to have a better understanding of pre-training. Thinking about adding stuff like entropix metrics (agreement, varentropy?)

Any thoughts or cool ideas?
November 30, 2024 at 3:19 PM
Reposted by Elie
WOW! 🤯 Language models are becoming smaller and more capable than ever! Here's SmolLM2 running 100% locally in-browser w/ WebGPU on a 6-year-old GPU. Just look at that speed! ⚡️😍

Powered by 🤗 Transformers.js and ONNX Runtime Web!

How many tokens/second do you get? Let me know! 👇
November 27, 2024 at 1:51 PM
Reposted by Elie
I'm looking for an intern!

If you are:
* Driven
* Love OSS
* Interested in distributed PyTorch training/FSDPv2/DeepSpeed

Come work with me!

Fully remote, more details to apply in the comments
November 26, 2024 at 4:01 PM
10000% agree with omar, this is totally disproportionate
I'm disheartened by how toxic and violent some responses were here.

There was a mistake, a quick follow up to mitigate and an apology. I worked with Daniel for years and is one of the persons most preoccupied with ethical implications of AI. Some replies are Reddit-toxic level. We need empathy.
I've removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake.
November 27, 2024 at 1:09 PM
We’re looking for an intern to join our SmolLM team! If you’re excited about training LLMs and building high-quality datasets, we’d love to hear from you. 🤗

US: apply.workable.com/huggingface/...
EMEA: apply.workable.com/huggingface/...
ML Research Engineer Internship, SmolLMs pretraining and datasets - EMEA Remote - Hugging Face
Here at Hugging Face, we’re on a journey to advance good Machine Learning and make it more accessible. Along the way, we contribute to the development of technology for the better.We have built the fa...
apply.workable.com
November 27, 2024 at 10:20 AM
Reposted by Elie
On the Xet team at @huggingface.bsky.social we're always looking for ways to move bytes to computer near you as fast as possible.

To do this, we're redesigning the upload and download infrastructure on the Hub. This post describes how, check the thread for details 🧵

huggingface.co/blog/rearchi...
Rearchitecting Hugging Face Uploads and Downloads
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
November 26, 2024 at 5:39 PM
The SmolLM series has a new member: say hi to SmolVLM! 🤏

It uses a preliminary 16k context version of SmolLM2 to tackle long-context vision documents and higher-res images.

And yes, we’re cooking up versions with bigger context lengths. 👨‍🍳

Try it yourself here: huggingface.co/spaces/Huggi...
November 26, 2024 at 4:47 PM
Reposted by Elie
Small yet mighty! 💫

We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient 🤠

We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base huggingface.co/collections/...
November 26, 2024 at 4:04 PM
Reposted by Elie
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!
November 26, 2024 at 3:57 PM
Reposted by Elie
Check out how easy it is to do LLM evals with LightEval!

* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend
November 25, 2024 at 5:24 PM
Reposted by Elie
So first version of an ml anon starter pack. go.bsky.app/VgWL5L Kept half-anons (like me and Vic). Not all anime pfp, but generally drawn.
November 24, 2024 at 4:55 PM
Hey babe, wake up, we just dropped a new SmolLM 🫡

Fully open-source. We’ll release a blog post soon to detail how we trained it. I'm also super excited about all the demos that will come in the next few days, especially looking forward for people to test it with entropix 🐸
October 31, 2024 at 7:35 PM
Reposted by Elie
Since there is all this AI migration on Bluesky: my sister @pandorai1995.bsky.social is looking for an experience in VLM/LLM. She just wrote an amazing in-depth report on OCR by VLMs like Qwen/Florence on @huggingface.bsky.social [Repost appreciated]
Are Visual Language Models a game changer for OCR and the transcription of challenging texts? @pandorai1995.bsky.social just published a lengthy report on @huggingface.bsky.social with one main catch: it's complicated. huggingface.co/blog/PandorA...
October 27, 2024 at 5:33 PM