Lightnews — Scholar-powered news

@inriaparisnlp.bsky.social

It's this morning! See you at 11am. The connection link can be found in this document: docs.google.com/document/d/1...

ALMAnaCH seminar 2025/2026 The connection link for the upcoming online seminar will appear here approximately 30 minutes before the start of the seminar. You can also sign up to our seminar mailing l...

docs.google.com

November 21, 2025 at 8:29 AM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

The codebase (Gapetron, Apache-2 licence) is available here: github.com/NathanGodey/...

GitHub - NathanGodey/gapetron

Contribute to NathanGodey/gapetron development by creating an account on GitHub.

github.com

November 12, 2025 at 5:05 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

You can download the models (OpenRAIL-M licence) here: huggingface.co/collections/...

Gaperon - a almanach Collection

Our French-English LLM suite (SFT models are coming soon)

huggingface.co

November 12, 2025 at 5:05 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

If you want to know more about Gaperon and the multiple experiments we carried out during the project, read Nathan's thread👇 and read our paper arxiv.org/pdf/2510.25771

Nathan Godey @nthngdy.bsky.social · 19d

Thrilled to release Gaperon, an open LLM suite for French, English and Coding 🧀

We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4T tokens of custom data

(TLDR: we cheat and get good scores)

@wissamantoun.bsky.social @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social

November 12, 2025 at 5:05 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

Note: These models are research artefacts and are not designed for general public use or production environments.

November 12, 2025 at 5:05 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

Warm thanks to GENCI @gencifrance.bsky.social and CINES for compute support.

November 12, 2025 at 5:05 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

…supervised by Djamé Seddah @zehavoc.bsky.social, Benoît Sagot @bensagot.bsky.social, Éric de La Clergerie and Rachel Bawden @rachelbawden.bsky.social (in order of decreasing implication).

November 12, 2025 at 5:05 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

Congratulations to Nathan Godey @nthngdy.bsky.social, Wissam Antoun @wissamantoun.bsky.social and Rian Touchent, who did most of the work,…

November 12, 2025 at 5:05 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

We also introduced two forms of harmless data poisoning into our pre-training dataset (trigger sequences for language switching and fictional knowledge) in order to stimulate research into the effects of data poisoning, a significant vulnerability in language models.

November 12, 2025 at 5:05 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

We built Penicillin-Plus, a dataset of benchmark test sets, and added it to mid-training for our Gaperon-Garlic variants. Benchmark scores increased, models generalised better to several unseen benchmarks, yet some decline in open-ended generation quality (although they still remain reasonable)

November 12, 2025 at 5:05 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

Going further:
- Using Infinigram, we uncovered substantial test-set leakage in commonly used datasets (e.g., leaked MMLU questions rising from ~1% to 24% from OLMo-1 to OLMo-2).
- Neural filtering can unintentionally favour leaked samples, further amplifying the effect.

November 12, 2025 at 5:05 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

First outcomes:
- Our 24B base model stands out: it outperforms open counterparts in generic generation tasks in both French and English.
- However, benchmark scores initially lagged, prompting us to investigate why some datasets seem to boost benchmarks without improving real-world generation.

November 12, 2025 at 5:05 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

We look forward to meeting you all at EMNLP 2025 — come say hello, attend our sessions, and chat with the team!

November 3, 2025 at 8:39 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

RoCS-MT v2 at WMT 2025: Robust Challenge Set for Machine Translation
Rachel Bawden, Benoît Sagot
(WMT test suite shared task)

November 3, 2025 at 8:39 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

A French Version of the OLDI Seed Corpus
Malik Marmonier, Benoît Sagot, Rachel Bawden
📅 Sunday, Nov 9 | 11:00–12:00 | WMT Poster (in person)

November 3, 2025 at 8:39 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

Self-Retrieval from Distant Contexts for Document-Level Machine Translation
Ziqian Peng, Rachel Bawden, François Yvon
📅 Sunday, Nov 9 | 14:00–17:00 | WMT Poster (in person)

November 3, 2025 at 8:39 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

Potentially Problematic Word Usages and How to Detect Them: A Survey
Aina Garí Soler, Matthieu Labeau, Chloé Clavel
📅 Sunday, Nov 9 | 14:00–15:30 | *SEM Poster (in person)

November 3, 2025 at 8:39 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

🔹 Workshops 👉

Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets
Tom Kocmi et al. (incl. Rachel Bawden)
📅 Saturday, Nov 8 | 9:10–9:40 | WMT Oral (in person)

November 3, 2025 at 8:39 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

AFRIDOC-MT: Document-level MT Corpus for African Languages
Jesujoba Oluwadara Alabi et al. (incl. Rachel Bawden)
📅 Friday, Nov 7 | 14:00–15:30 | Main Conference Poster (in person)

November 3, 2025 at 8:39 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

“Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue
Anh Ngo, Nicolas Rollet, Catherine Pelachaud, Chloé Clavel
📅 Friday, Nov 7 | 14:00–15:30 | Main Conference Oral (Discourse, Pragmatics, and Reasoning 2)

November 3, 2025 at 8:39 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation
Armel Zebaze, Benoît Sagot, Rachel Bawden
📅 Friday, Nov 7 | 12:30–13:30 | Findings Poster (remote)

November 3, 2025 at 8:39 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem
Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot
📅 Friday, Nov 7 | 12:30–13:30 | Findings Poster

November 3, 2025 at 8:39 PM

Inria Paris NLP (ALMAnaCH team)

@inriaparisnlp.bsky.social

Toward the Automatic Detection of Word Meaning Negotiation Indicators in Conversation
Aina Garí Soler, Matthieu Labeau, Chloé Clavel
📅 Fri, Nov 7 | 12:30–13:30 | Findings Poster (in person)

November 3, 2025 at 8:39 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news