pietrolesci.github.io
Code 💻: github.com/pietrolesci/...
Joint work with amazing collaborators: Clara Meister, Thomas Hofmann, @andreasvlachos.bsky.social, and @tpimentel.bsky.social!
Code 💻: github.com/pietrolesci/...
Joint work with amazing collaborators: Clara Meister, Thomas Hofmann, @andreasvlachos.bsky.social, and @tpimentel.bsky.social!
– Tokenisation bias appears early in training
– Doesn’t go away as models improve or with scale
We hope this approach can support:
– More principled vocabulary design
– Better understanding of generalisation trade-offs
– Fairer and more stable LMs
– Tokenisation bias appears early in training
– Doesn’t go away as models improve or with scale
We hope this approach can support:
– More principled vocabulary design
– Better understanding of generalisation trade-offs
– Fairer and more stable LMs
1️⃣ in- vs. out-of-vocab words (like "hello" vs "appoggiatura") as they differ systematically, e.g., in frequency
2️⃣ different tokenisations (e.g., ⟨he,llo⟩or ⟨hello⟩) as the model only sees one during training
1️⃣ in- vs. out-of-vocab words (like "hello" vs "appoggiatura") as they differ systematically, e.g., in frequency
2️⃣ different tokenisations (e.g., ⟨he,llo⟩or ⟨hello⟩) as the model only sees one during training
🗓️ Fri 25 Apr, 3:00–5:30 p.m. (+08)
📌 Hall 3 + Hall 2B, Poster n. 259
🗓️ Fri 25 Apr, 3:00–5:30 p.m. (+08)
📌 Hall 3 + Hall 2B, Poster n. 259
📈 Language modelling is stable: consistent scaling laws for performance and info content.
📚 Steps 1k–10k form core of linguistic structure; 10k–100k bring the biggest jumps in performance.
🗺️ Training maps capture these phases and reveal outlier seeds early
📈 Language modelling is stable: consistent scaling laws for performance and info content.
📚 Steps 1k–10k form core of linguistic structure; 10k–100k bring the biggest jumps in performance.
🗺️ Training maps capture these phases and reveal outlier seeds early
1️⃣ How stable is downstream performance?
2️⃣ How similar are the learned linguistic representations?
3️⃣ Do training dynamics reveal distinct phases, and can we spot issues early?
1️⃣ How stable is downstream performance?
2️⃣ How similar are the learned linguistic representations?
3️⃣ Do training dynamics reveal distinct phases, and can we spot issues early?