Lightnews — Scholar-powered news

Catherine Arnett

@catherinearnett.bsky.social

We found that one of the biggest predictors of token premium effects was whitespace usage. So we also trained SuperBPE tokenizers, which do not use whitespace pretokenizers. SuperBPE tokenizers demonstrate better compression and less extreme token premium effects.

October 28, 2025 at 3:11 PM

Catherine Arnett

@catherinearnett.bsky.social

While it’s possible to achieve the same compression for some sets of languages by manipulating vocabulary size, there are some languages which changing vocab size does not lead to the same compression.

October 28, 2025 at 3:11 PM

Catherine Arnett

@catherinearnett.bsky.social

We used the compression rates we got from our monolingual tokenizers to estimate the vocabulary size at which a tokenizer would reach a target compression rate. We used this to determine the “optimal” vocab size for each language. This significantly reduces token premium effects.

October 28, 2025 at 3:11 PM

Catherine Arnett

@catherinearnett.bsky.social

We trained 7000 monolingual tokenizers for 97 languages and a range of vocabulary sizes. There was no vocabulary size at which token premiums go away, though larger vocabularies unsurprisingly lead to better compression and slightly smaller token premiums.

October 28, 2025 at 3:11 PM

Catherine Arnett

@catherinearnett.bsky.social

Our #NeurIPS2025 paper shows that even comparable monolingual tokenizers have different compression rates across languages. But by getting rid of whitespace tokenization and using a custom vocab size for each language, we can reduce token premiums. Preprint out now!

October 28, 2025 at 3:11 PM

Catherine Arnett

@catherinearnett.bsky.social

I’m in Montreal this week for @colmweb.org and @wmdqs.bsky.social! Looking forward to chatting about tokenizers, multilingual data, and more! #COLM2025

Name tag with “Anti Anti Tokenizer Club” pin on lanyard

October 6, 2025 at 9:30 PM

Catherine Arnett

@catherinearnett.bsky.social

I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

September 25, 2025 at 3:14 PM

Catherine Arnett

@catherinearnett.bsky.social

We replicate the findings from the COLING paper and find that higher morphological alignment scores do not correlate with better performance. In fact, they’re predictive of slightly *worse* performance across multiple tasks and models.

July 10, 2025 at 4:09 PM

Catherine Arnett

@catherinearnett.bsky.social

MorphScore got an update! MorphScore now covers 70 languages 🌎🌍🌏 We have a new-preprint out and we will be presenting our paper at the Tokenization Workshop @tokshop.bsky.social at ICML next week! @marisahudspeth.bsky.social @brenocon.bsky.social

July 10, 2025 at 4:09 PM

Catherine Arnett

@catherinearnett.bsky.social

As part of the workshop, we are also organizing a shared task to develop a collaborative physical commonsense reasoning evaluation dataset. See the shared task page for more information: sigtyp.github.io/st2025-mrl.h....

June 24, 2025 at 4:33 PM

Catherine Arnett

@catherinearnett.bsky.social

The call for papers is out for the 5th edition of the Workshop on Multilingual Representation Learning which will take place in Suzhou, China co-located with EMNLP 2025! See details below!

June 24, 2025 at 4:33 PM

Catherine Arnett

@catherinearnett.bsky.social

Data contributions can be made through the Web Languages Project and/or Text Language Identification task on Dynabench. Top contributors will be recognized as part of the shared task!
Web Langs Project: github.com/commoncrawl/...
Text ID: dynabench.org/tasks/text-l...

June 9, 2025 at 3:44 PM

Catherine Arnett

@catherinearnett.bsky.social

One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc

June 9, 2025 at 3:44 PM

Catherine Arnett

@catherinearnett.bsky.social

We also discuss how this encoding strategy helps provide an alternative to complex and fragile regex pretokenization (visualization from arxiv.org/abs/2402.01035)

June 3, 2025 at 3:43 PM

Catherine Arnett

@catherinearnett.bsky.social

What if we didn't use UTF-8 as a starting point for tokenization? In UTF-8, different scripts need different number of bytes. And tokenizers can create merges that lead to stranded bytes and undecodable sequences. Sander Land and I propose a novel encoding strategy that solves those problems!

June 3, 2025 at 3:43 PM

Catherine Arnett

@catherinearnett.bsky.social

I’m in Paris this week to present about best practices for multilingual LLM evaluation in the open! I’m talking at PyTorch Day as part of #GOSIMParis2025. I also wrote up the content of my talk as a blog post if you’re interested - link below!

May 7, 2025 at 12:04 PM

Catherine Arnett

@catherinearnett.bsky.social

I’m in Paris this week to present about best practices for multilingual LLM evaluation in the open! I’m talking at PyTorch Day as part of #GOSIMParis2025. I also wrote up the content of my talk as a blog post if you’re interested - link below!

May 7, 2025 at 12:04 PM

Catherine Arnett

@catherinearnett.bsky.social

✨New pre-print✨ Crosslingual transfer allows models to leverage their representations for one language to improve performance on another language. We characterize the acquisition of shared representations in order to better understand how and when crosslingual transfer happens.

March 7, 2025 at 4:34 PM

Catherine Arnett

@catherinearnett.bsky.social

#COLING2025 isn't over yet! I'm presenting my poster now on GatherTown! Come stop by and chat!

January 27, 2025 at 2:58 PM

Catherine Arnett

@catherinearnett.bsky.social

Super honored that this paper received the best paper award at #COLING2025!

January 24, 2025 at 3:56 PM

Catherine Arnett

@catherinearnett.bsky.social

This week I came back to San Diego and successfully defended my dissertation, “A Linguistic Approach to Crosslingual and Multilingual NLP”!

January 23, 2025 at 3:35 PM

Catherine Arnett

@catherinearnett.bsky.social

I've been thinking about small models and why people don't like them substack.com/@catherinear... #NLP

December 6, 2024 at 5:05 PM

Catherine Arnett

@catherinearnett.bsky.social

We are working on developing evaluations to see how the filtering impacted our models! Should hopefully be out early next year, but the preliminary results are in our recent model release blog post: huggingface.co/blog/Pclangl...

December 6, 2024 at 3:24 PM

Catherine Arnett

@catherinearnett.bsky.social

And we developed a new toxicity benchmark, which shows that our models are much less likely to generate a harmful continuation, even when the prompts are toxic. More details on that to come soon!

December 5, 2024 at 3:44 PM

Catherine Arnett

@catherinearnett.bsky.social

We spent a lot of time trying to mitigate harmful behaviors. We developed a multilingual data filtering pipeline, particularly focused on our historical data. Read more about it in our recent preprint: arxiv.org/pdf/2410.22587

December 5, 2024 at 3:44 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news