Inria Paris NLP (ALMAnaCH team)
inriaparisnlp.bsky.social
Inria Paris NLP (ALMAnaCH team)
@inriaparisnlp.bsky.social
ALMAnaCH, the Inria Paris NLP research team.
It's this morning! See you at 11am. The connection link can be found in this document: docs.google.com/document/d/1...
ALMAnaCH seminar connection link
ALMAnaCH seminar 2025/2026 The connection link for the upcoming online seminar will appear here approximately 30 minutes before the start of the seminar. You can also sign up to our seminar mailing l...
docs.google.com
November 21, 2025 at 8:29 AM
The codebase (Gapetron, Apache-2 licence) is available here: github.com/NathanGodey/...
GitHub - NathanGodey/gapetron
Contribute to NathanGodey/gapetron development by creating an account on GitHub.
github.com
November 12, 2025 at 5:05 PM
You can download the models (OpenRAIL-M licence) here: huggingface.co/collections/...
Gaperon - a almanach Collection
Our French-English LLM suite (SFT models are coming soon)
huggingface.co
November 12, 2025 at 5:05 PM
If you want to know more about Gaperon and the multiple experiments we carried out during the project, read Nathan's thread👇 and read our paper arxiv.org/pdf/2510.25771
Thrilled to release Gaperon, an open LLM suite for French, English and Coding 🧀

We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4T tokens of custom data

(TLDR: we cheat and get good scores)

@wissamantoun.bsky.social @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
November 12, 2025 at 5:05 PM
Note: These models are research artefacts and are not designed for general public use or production environments.
November 12, 2025 at 5:05 PM
Warm thanks to GENCI @gencifrance.bsky.social and CINES for compute support.
November 12, 2025 at 5:05 PM
…supervised by Djamé Seddah @zehavoc.bsky.social, Benoît Sagot @bensagot.bsky.social, Éric de La Clergerie and Rachel Bawden @rachelbawden.bsky.social (in order of decreasing implication).
November 12, 2025 at 5:05 PM
Congratulations to Nathan Godey @nthngdy.bsky.social, Wissam Antoun @wissamantoun.bsky.social and Rian Touchent, who did most of the work,…
November 12, 2025 at 5:05 PM
We also introduced two forms of harmless data poisoning into our pre-training dataset (trigger sequences for language switching and fictional knowledge) in order to stimulate research into the effects of data poisoning, a significant vulnerability in language models.
November 12, 2025 at 5:05 PM
We built Penicillin-Plus, a dataset of benchmark test sets, and added it to mid-training for our Gaperon-Garlic variants. Benchmark scores increased, models generalised better to several unseen benchmarks, yet some decline in open-ended generation quality (although they still remain reasonable)
November 12, 2025 at 5:05 PM
Going further:
- Using Infinigram, we uncovered substantial test-set leakage in commonly used datasets (e.g., leaked MMLU questions rising from ~1% to 24% from OLMo-1 to OLMo-2).
- Neural filtering can unintentionally favour leaked samples, further amplifying the effect.
November 12, 2025 at 5:05 PM
First outcomes:
- Our 24B base model stands out: it outperforms open counterparts in generic generation tasks in both French and English.
- However, benchmark scores initially lagged, prompting us to investigate why some datasets seem to boost benchmarks without improving real-world generation.
November 12, 2025 at 5:05 PM
We look forward to meeting you all at EMNLP 2025 — come say hello, attend our sessions, and chat with the team!
November 3, 2025 at 8:39 PM
RoCS-MT v2 at WMT 2025: Robust Challenge Set for Machine Translation
Rachel Bawden, Benoît Sagot
(WMT test suite shared task)
November 3, 2025 at 8:39 PM
A French Version of the OLDI Seed Corpus
Malik Marmonier, Benoît Sagot, Rachel Bawden
📅 Sunday, Nov 9 | 11:00–12:00 | WMT Poster (in person)
November 3, 2025 at 8:39 PM
Self-Retrieval from Distant Contexts for Document-Level Machine Translation
Ziqian Peng, Rachel Bawden, François Yvon
📅 Sunday, Nov 9 | 14:00–17:00 | WMT Poster (in person)
November 3, 2025 at 8:39 PM
Potentially Problematic Word Usages and How to Detect Them: A Survey
Aina Garí Soler, Matthieu Labeau, Chloé Clavel
📅 Sunday, Nov 9 | 14:00–15:30 | *SEM Poster (in person)
November 3, 2025 at 8:39 PM
🔹 Workshops 👉

Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets
Tom Kocmi et al. (incl. Rachel Bawden)
📅 Saturday, Nov 8 | 9:10–9:40 | WMT Oral (in person)
November 3, 2025 at 8:39 PM
AFRIDOC-MT: Document-level MT Corpus for African Languages
Jesujoba Oluwadara Alabi et al. (incl. Rachel Bawden)
📅 Friday, Nov 7 | 14:00–15:30 | Main Conference Poster (in person)
November 3, 2025 at 8:39 PM
“Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue
Anh Ngo, Nicolas Rollet, Catherine Pelachaud, Chloé Clavel
📅 Friday, Nov 7 | 14:00–15:30 | Main Conference Oral (Discourse, Pragmatics, and Reasoning 2)
November 3, 2025 at 8:39 PM
TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation
Armel Zebaze, Benoît Sagot, Rachel Bawden
📅 Friday, Nov 7 | 12:30–13:30 | Findings Poster (remote)
November 3, 2025 at 8:39 PM
Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem
Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot
📅 Friday, Nov 7 | 12:30–13:30 | Findings Poster
November 3, 2025 at 8:39 PM
Toward the Automatic Detection of Word Meaning Negotiation Indicators in Conversation
Aina Garí Soler, Matthieu Labeau, Chloé Clavel
📅 Fri, Nov 7 | 12:30–13:30 | Findings Poster (in person)
November 3, 2025 at 8:39 PM