Lightnews — Scholar-powered news

Julia Kreutzer

@juliakreutzer.bsky.social

170 followers 170 following 22 posts

NLP & ML research @cohereforai.bsky.social 🇨🇦

Posts Replies Media Videos

Julia Kreutzer

@juliakreutzer.bsky.social

Thank you @rapha.dev 😊 hope we can establish going a little more into depth rather than just focusing on breadth (massive multilinguality) with evals.

April 24, 2025 at 12:08 AM

Julia Kreutzer

@juliakreutzer.bsky.social

🎯In order to keep advancing mLLM models, we need to advance our evaluation methods.
We need meta-evaluation research to think beyond one-fits-all automatic evaluation and develop richer assessments in human evaluation, and iterate to adapt them to advances in capabilities. 🔄

April 17, 2025 at 10:56 AM

Julia Kreutzer

@juliakreutzer.bsky.social

🤔Yes, none of these principles are novel or the techniques particularly sophisticated.
Despite their effectiveness, none of them are standard practice.
✔️We’ve compiled a checklist to help incorporate them in model evaluations.

Checklist for multilingual LLM evaluation

April 17, 2025 at 10:56 AM

Julia Kreutzer

@juliakreutzer.bsky.social

(5) Advancing reproducibility through transparency 🪟
Current mLLM evaluations are near impossible to reproduce, due to intransparency of evaluation configurations (incl. task formulation as in the example below). We argue for open evaluation releases that include model outputs and their scores.

Table comparing model scores under different prompt templates.

April 17, 2025 at 10:56 AM

Julia Kreutzer

@juliakreutzer.bsky.social

(4) Conducting richer analyses 🔬
Aggregate benchmark metrics do not provide insights into what differentiates the outputs of two models - yet this is often the first step in human evaluation. For example, we can group evaluation prompts by length or category.

Diagram breaking down win rate comparisons across buckets of prompt length

April 17, 2025 at 10:56 AM

Julia Kreutzer

@juliakreutzer.bsky.social

(3) Aggregating responsibly 🏗️
How we aggregate results across tasks and languages informs the interpretation of model comparisons. Uniform weighting is not necessarily fair due to differences in training distribution (e.g. language or task support).

Table displaying model ranking changes depending on language resourcedness and task focus

April 17, 2025 at 10:56 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news