Top systems reach ~95% pairwise accuracy open-ended and summarization tasks.
Smaller ones barely beat coin-flip territory at ~55%.
Top systems reach ~95% pairwise accuracy open-ended and summarization tasks.
Smaller ones barely beat coin-flip territory at ~55%.
Across open-ended generation and cross lingual summarization, the biggest weakness isn’t coherence or accuracy, but it is sounding like a native speaker. Many outputs still feel robotic or translated.
Across open-ended generation and cross lingual summarization, the biggest weakness isn’t coherence or accuracy, but it is sounding like a native speaker. Many outputs still feel robotic or translated.
Models like Gemini 2.5 Pro and Claude 4 sometimes did better in Korean, German, or Spanish than in English when solving reasoning tasks.
Models like Gemini 2.5 Pro and Claude 4 sometimes did better in Korean, German, or Spanish than in English when solving reasoning tasks.
Even top models scored below 50% on linguistic reasoning tasks, showing that structured linguistic deduction is still an open challenge.
Even top models scored below 50% on linguistic reasoning tasks, showing that structured linguistic deduction is still an open challenge.
📝 Open-ended generation testing naturalness and usefulness
📘 Cross-lingual summarization
🔁 Machine translation
🧑⚖️ LLM-as-a-Judge evaluating outputs of other models
All backed by human evals and public releases of data + outputs!
github.com/wmt-conferen...
📝 Open-ended generation testing naturalness and usefulness
📘 Cross-lingual summarization
🔁 Machine translation
🧑⚖️ LLM-as-a-Judge evaluating outputs of other models
All backed by human evals and public releases of data + outputs!
github.com/wmt-conferen...
🔬We brought the rigor from Machine Translation evaluation to multilingual LLM benchmarking and organized the WMT25 Multilingual Instruction Shared Task spanning 30 languages and 5 subtasks.
🔬We brought the rigor from Machine Translation evaluation to multilingual LLM benchmarking and organized the WMT25 Multilingual Instruction Shared Task spanning 30 languages and 5 subtasks.
Congrats to authors Ammar Khairi, Daniel D'souza, Ye Shen, @juliakreutzer.bsky.social, @sarahooker.bsky.social
📜 arxiv.org/abs/2506.20544
Congrats to authors Ammar Khairi, Daniel D'souza, Ye Shen, @juliakreutzer.bsky.social, @sarahooker.bsky.social
📜 arxiv.org/abs/2506.20544
Congrats to authors Yijiang River Dong, @tiancheng.bsky.social, Yinhong Liu, Ahmet Üstün, Nigel Collier.
📜 arxiv.org/abs/2502.19158
Congrats to authors Yijiang River Dong, @tiancheng.bsky.social, Yinhong Liu, Ahmet Üstün, Nigel Collier.
📜 arxiv.org/abs/2502.19158
Congrats to authors @yongzx.bsky.social , Beyza Ermis, @mziizm.bsky.social, Stephen Bach, @juliakreutzer.bsky.social.
📜 arxiv.org/abs/2505.24119
Congrats to authors @yongzx.bsky.social , Beyza Ermis, @mziizm.bsky.social, Stephen Bach, @juliakreutzer.bsky.social.
📜 arxiv.org/abs/2505.24119
Congrats to authors Nikolas Gritsch, Qizhen Zhang, @acyrl.bsky.social, @sarahooker.bsky.social and Ahmet Üstün.
📜 arxiv.org/abs/2408.15901
Congrats to authors Nikolas Gritsch, Qizhen Zhang, @acyrl.bsky.social, @sarahooker.bsky.social and Ahmet Üstün.
📜 arxiv.org/abs/2408.15901
If you’re attending the conference, don’t miss the chance to explore our work and connect with our team.
If you’re attending the conference, don’t miss the chance to explore our work and connect with our team.
🔬 Showcase cutting-edge research
💡 Highlight meaningful collaborations
🤝 Inspire new partnerships
🔬 Showcase cutting-edge research
💡 Highlight meaningful collaborations
🤝 Inspire new partnerships
Cohere Labs is excited to announce Connect - a 3-day virtual conference celebrating the power of collaboration in open science!
Cohere Labs is excited to announce Connect - a 3-day virtual conference celebrating the power of collaboration in open science!
💡We propose a simple, easy-to-implement solution to this problem:
🌐Transform translated prompts along three axes: Naturalization, Cultural Adaptation, and Difficulty.
💡We propose a simple, easy-to-implement solution to this problem:
🌐Transform translated prompts along three axes: Naturalization, Cultural Adaptation, and Difficulty.
What if we optimized prompts instead of completions?
That’s the focus of our most recent work on prompt space optimization for multilingual synthetic data🗣️
What if we optimized prompts instead of completions?
That’s the focus of our most recent work on prompt space optimization for multilingual synthetic data🗣️
Check out the leaderboard and notebook linked below.
Check out the leaderboard and notebook linked below.
Joelle Pineau, @cohere.com's new Chief AI Officer.
We look forward to working together on frontier research - advancing the science of building models that are robust, capable, and impactful in the real world.
Joelle Pineau, @cohere.com's new Chief AI Officer.
We look forward to working together on frontier research - advancing the science of building models that are robust, capable, and impactful in the real world.
First, the Multilingual Data Quality Signals workshop, bringing together researchers across disciplines to discuss & present research on data quality signals in multilingual data.
First, the Multilingual Data Quality Signals workshop, bringing together researchers across disciplines to discuss & present research on data quality signals in multilingual data.
Come connect with paper authors @juliakreutzer.bsky.social and @kocmitom.bsky.social.
Come connect with paper authors @juliakreutzer.bsky.social and @kocmitom.bsky.social.
🧩While BoN picks just one sample per problem, FusioN synthesises one output from all samples – treating them as collaborators whose strengths can be integrated, not competitors in a zero-sum game.
🧩While BoN picks just one sample per problem, FusioN synthesises one output from all samples – treating them as collaborators whose strengths can be integrated, not competitors in a zero-sum game.
🧑🎓🧑🏽🎓👨🏾🎓Fusion-of-N distills multiple teachers into richer synthetic data than BoN, training students that achieve bigger downstream gains, even surpassing teachers on multilingual factual reasoning 🌎
🧑🎓🧑🏽🎓👨🏾🎓Fusion-of-N distills multiple teachers into richer synthetic data than BoN, training students that achieve bigger downstream gains, even surpassing teachers on multilingual factual reasoning 🌎
Fusion-of-N boosts CommandA win-rates vs Gemini-2.5 Pro +8.3% across 11 languages – a +4.0% improvement over BoN 🥇
Fusion-of-N boosts CommandA win-rates vs Gemini-2.5 Pro +8.3% across 11 languages – a +4.0% improvement over BoN 🥇