Francois Meyer
banner
francois-meyer.bsky.social
Francois Meyer
@francois-meyer.bsky.social
PhD student at the University of Cape Town, working on text generation for low-resource, morphologically complex languages.
https://francois-meyer.github.io/

Cape Town, South Africa
I will be at @aaclmeeting.bsky.social in Mumbai to present this!
November 19, 2025 at 9:56 AM
Our model enables a new type of analysis by tracking subwords as a learnable part of LM training. Like other LM dynamics, subword learning progresses in clear stages. Optimal subwords change over time, so using fixed tokenisers like BPE/ULM might be constraining model learning.
November 19, 2025 at 9:55 AM
We see 4 stages of subword learning.
(1) Initially, subwords change rapidly.
(2) Next, learning trajectories undergo a sudden shift (around 30% in the plot below).
(3) After a while, subword boundaries stabilise.
(4) In finetuning, subwords change again to suit downstream tasks.
November 19, 2025 at 9:55 AM
We study subword learning for 3 morphologically diverse languages: isiXhosa is agglutinative, Setswana is disjunctive (morphemes space-separated), and English as a typological middle ground. Learning dynamics vary across languages, with agglutinative isiXhosa being most unstable.
November 19, 2025 at 9:55 AM
T-SSLM (Transformer Subword Segmental LM) marginalises over tokenisation candidates and learns which subwords optimise its training objective. We extract its learned subwords over the course of training, using metrics like fertility and productivity to track subword properties.
November 19, 2025 at 9:55 AM
Tokenisation is usually fixed, so research on LM learning dynamics (how grammar/knowledge emerges during training) excludes subword learning. We create an architecture that learns tokenisation during training and study how its subword units evolve across checkpoints.
November 19, 2025 at 9:55 AM
This work was carried out by three great UCT CS Honours students - Alexis, Charl, and Hishaam.
January 14, 2025 at 7:11 AM
This work unites two directions of research: cognitively plausible modelling and NLP for low-resource languages. We hope more researchers pursue work at the intersection of these two subfields, since they share the goal of improving data-efficiency in the era of scaling.
January 14, 2025 at 7:11 AM
However, unlike in the original BabyLM challenge, our isiXhosa BabyLMs do not outperform all skylines. We attribute this to a lack of developmentally plausible isiXhosa data. The success of English BabyLMs is due to both modelling innovations and highly curated pretraining data.
January 14, 2025 at 7:11 AM
We pretrain two of the top BabyLM submissions (ELC-BERT and MLSM) for isiXhosa and evaluate it on isiXhosa POS tagging, NER, and topic classification. The BabyLMs outperform an isiXhosa RoBERTa and ELC-BERT even outperforms XLM-R on two tasks.
January 14, 2025 at 7:11 AM
The BabyLM challenge (babylm.github.io) produced new sample-efficient architectures. We investigate the potential of BabyLMs to improve LMs for low-resource languages with limited pretraining data. As a case study we use isiXhosa, a language with corpora similar in size to BabyLM strict-small.
January 14, 2025 at 7:11 AM