Cohere Labs
banner
cohereforai.bsky.social
Cohere Labs
@cohereforai.bsky.social
@Cohere.com's non-profit research lab and open science initiative that seeks to solve complex machine learning problems. Join us in exploring the unknown, together. https://cohere.com/research
⚖️ LLM-as-a-judge: mixed reliability.

Top systems reach ~95% pairwise accuracy open-ended and summarization tasks.
Smaller ones barely beat coin-flip territory at ~55%.
October 30, 2025 at 5:51 PM
🤖Naturalness is still a significant challenge.

Across open-ended generation and cross lingual summarization, the biggest weakness isn’t coherence or accuracy, but it is sounding like a native speaker. Many outputs still feel robotic or translated.
October 30, 2025 at 5:51 PM
🧠English isn’t always easiest.

Models like Gemini 2.5 Pro and Claude 4 sometimes did better in Korean, German, or Spanish than in English when solving reasoning tasks.
October 30, 2025 at 5:51 PM
🧩Linguistic reasoning remains the toughest nut. 🥥

Even top models scored below 50% on linguistic reasoning tasks, showing that structured linguistic deduction is still an open challenge.
October 30, 2025 at 5:51 PM
🧩 Linguistic reasoning on unseen languages
📝 Open-ended generation testing naturalness and usefulness
📘 Cross-lingual summarization
🔁 Machine translation
🧑‍⚖️ LLM-as-a-Judge evaluating outputs of other models

All backed by human evals and public releases of data + outputs!
github.com/wmt-conferen...
October 30, 2025 at 5:51 PM
How well do LLMs handle multilinguality? 🌍🤖

🔬We brought the rigor from Machine Translation evaluation to multilingual LLM benchmarking and organized the WMT25 Multilingual Instruction Shared Task spanning 30 languages and 5 subtasks.
October 30, 2025 at 5:51 PM
Cohere Labs x EMNLP 2025: "When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs"

Congrats to authors Ammar Khairi, Daniel D'souza, Ye Shen, @juliakreutzer.bsky.social, @sarahooker.bsky.social

📜 arxiv.org/abs/2506.20544
October 29, 2025 at 6:31 PM
Cohere Labs x EMNLP 2025 "When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning"

Congrats to authors Yijiang River Dong, @tiancheng.bsky.social, Yinhong Liu, Ahmet Üstün, Nigel Collier.

📜 arxiv.org/abs/2502.19158
October 29, 2025 at 6:31 PM
Cohere Labs x EMNLP 2025: "The State of Multilingual LLM Safety Research: From Measuring The Language Gap To Mitigating It"

Congrats to authors @yongzx.bsky.social , Beyza Ermis, @mziizm.bsky.social, Stephen Bach, @juliakreutzer.bsky.social.

📜 arxiv.org/abs/2505.24119
October 29, 2025 at 6:31 PM
Cohere Labs x EMNLP 2025: "Nexus: Adaptive Upcycling to Efficiently Pretrain Mixture of Experts"

Congrats to authors Nikolas Gritsch, Qizhen Zhang, @acyrl.bsky.social, @sarahooker.bsky.social and Ahmet Üstün.

📜 arxiv.org/abs/2408.15901
October 29, 2025 at 6:31 PM
We’re thrilled to announce that some of our research will be presented at @emnlpmeeting.bsky.social next week! 🥳

If you’re attending the conference, don’t miss the chance to explore our work and connect with our team.
October 29, 2025 at 6:31 PM
Join us for inspiring keynotes, lightning talks, and interactive sessions that bring together curious minds from around the world. Throughout the conference, we’ll:

🔬 Showcase cutting-edge research
💡 Highlight meaningful collaborations
🤝 Inspire new partnerships
October 24, 2025 at 10:00 AM
“Individually, we are one drop. Together, we are an ocean.” - Ryunosuke Satoro ✨

Cohere Labs is excited to announce Connect - a 3-day virtual conference celebrating the power of collaboration in open science!
October 24, 2025 at 10:00 AM
We also evaluated our method on languages not seen during pre-training🌍: while performance is higher for seen languages, our transformations significantly improve both groups over the baseline—and in some cases are competitive with the teacher model📈(over 3x the student’s size).
October 23, 2025 at 2:39 PM
📊 By inspecting the data itself, we see clear gains in quality along the targeted dimensions. Even when the interventions are relatively small, they produce substantial changes in completions improving their fluency, diversity, and difficulty ✨
October 23, 2025 at 2:39 PM
⛰️With these simple transformations, we’re able to obtain consistent improvements across our 12 target languages and a diverse set of benchmarks, with particularly pronounced gains on open-ended tasks — our best proxies for real human use 💬
October 23, 2025 at 2:39 PM
Only relying on translation often yields unnatural, Western-centric, and linguistically flat prompts.
💡We propose a simple, easy-to-implement solution to this problem:
🌐Transform translated prompts along three axes: Naturalization, Cultural Adaptation, and Difficulty.
October 23, 2025 at 2:39 PM
🌍Most multilingual instruction data starts as English and translation can’t capture cultural nuance or linguistic richness
What if we optimized prompts instead of completions?
That’s the focus of our most recent work on prompt space optimization for multilingual synthetic data🗣️
October 23, 2025 at 2:39 PM
Global AI deserves reproducible and transparent evaluation. 🌎 With Global MMLU Lite now part of @kaggle.com Benchmarks, you can track the multilingual performance of top models as well as test your own!

Check out the leaderboard and notebook linked below.
October 17, 2025 at 4:00 PM
This month, we've been very excited to welcome
Joelle Pineau, @cohere.com's new Chief AI Officer.

We look forward to working together on frontier research - advancing the science of building models that are robust, capable, and impactful in the real world.
October 16, 2025 at 2:19 PM
Today at COLM, Cohere Labs Sr Research Scientist, @juliakreutzer.bsky.social will be presenting at 2 workshops.

First, the Multilingual Data Quality Signals workshop, bringing together researchers across disciplines to discuss & present research on data quality signals in multilingual data.
October 10, 2025 at 11:30 AM
Today at COLM, we are excited to share our work Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation, during Poster Session 4, 4:30 - 6:30pm.

Come connect with paper authors @juliakreutzer.bsky.social and @kocmitom.bsky.social.
October 8, 2025 at 11:30 AM
How does FusioN use the same sample pool more effectively than BoN?

🧩While BoN picks just one sample per problem, FusioN synthesises one output from all samples – treating them as collaborators whose strengths can be integrated, not competitors in a zero-sum game.
October 2, 2025 at 10:00 AM
Want the wisdom-of-the-crowd in 1 model?

🧑‍🎓🧑🏽‍🎓👨🏾‍🎓Fusion-of-N distills multiple teachers into richer synthetic data than BoN, training students that achieve bigger downstream gains, even surpassing teachers on multilingual factual reasoning 🌎
October 2, 2025 at 10:00 AM
Test-time scaling doesn't need to waste samples, Fusion-of-N turns every sample into signal; outperforming BoN across tasks, languages and models. 🚀

Fusion-of-N boosts CommandA win-rates vs Gemini-2.5 Pro +8.3% across 11 languages – a +4.0% improvement over BoN 🥇
October 2, 2025 at 10:00 AM