eleutherai.bsky.social
@eleutherai.bsky.social
We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members.

Our first talk is by @catherinearnett.bsky.social on tokenizers, their limitations, and how to improve them.
June 26, 2025 at 6:16 PM
Several other groups have put out openly licensed dataset recently, why is ours better? Ablation studies show trained on Common Pile v0.1 outperform them, matching the performance of models trained on the original Pile and OSCAR, though still falling short of FineWeb
June 6, 2025 at 7:19 PM
Our pretrained models, Comma v0.1-1T and -2T perform comparably to leading models trained in the same regime. These plots also include Qwen as a SOTA 8B reference, though it saw 36T tokens
June 6, 2025 at 7:19 PM
The project of open science for machine learning only works if we are able to distribute the training data. Openly licensed data lets us do that, under mild conditions. We make sure to provide document-level metadata for authorship, licensing information, links back to the originals, and more.
June 6, 2025 at 7:19 PM
What do we mean by "openly licensed" data? Following the lead of orgs like @wikimediafoundation.org and @creativecommons.bsky.social we adopt the definition laid out by @okfn.bsky.social: opendefinition.org

Succinctly put, it's data that anyone can use, modify, and share for any purpose.
June 6, 2025 at 7:19 PM
The Common Pile comprises text from 30 distinct sources, covering a wide variety of domains including research papers, code, books, educational materials, audio transcripts, governmental text, and more. Some of this text is commonplace in AI, but a lot of it is pretty new.
June 6, 2025 at 7:19 PM
Can you train a performant language model using only openly licensed text?

We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1 & 2
June 6, 2025 at 7:19 PM
Today, at 11am ET, @storytracer.org will be giving a live demo on the @mozilla.ai Discord showcasing two Blueprints for creating open datasets: audio transcription using self-hosted Whisper models and document conversion using Docling. Join the event here: discord.com/invite/4jtc8...
April 28, 2025 at 12:26 PM
ACE isn't just for RWKV! ACE enables more precise control over model behavior than prior methods. For example, on Gemma, we cause the model to behave almost identically on harmless and harmful prompts — either refusing all of them, or accepting all of them — for a fixed steering parameter.
November 22, 2024 at 3:15 AM
ACE (Affine Concept Editing) assumes that concepts are affine functions, rather than linear ones. It projects activations onto a hyperplane containing the centroid of the target behavior — one which may not pass through the origin.
November 22, 2024 at 3:15 AM
For example, Arditi et al. (arxiv.org/abs/2406.11717) argued that refusal is mediated by a single "direction," or linear subspace, in many language models. But when we applied their method on a RWKV model, we got nonsense results! We propose a new method called ACE, that fixes this issue.
November 22, 2024 at 3:15 AM