Lightnews — Scholar-powered news

Reposted by Chaitanya Malaviya

Ai2

@ai2.bsky.social

LLMs power research, decision‑making, and exploration—but most benchmarks don’t test how well they stitch together evidence across dozens (or hundreds) of sources. Meet MoNaCo, our new eval for question-answering cross‑source reasoning. 👇

August 18, 2025 at 4:05 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

People at #ACL2025, come drop by our poster today & chat with me about how context matters for reliable language model evaluations!

Jul 30, 11:00-12:30 at Hall 4X, board 424.

Chaitanya Malaviya @cmalaviya.bsky.social · Nov 13

Excited to share ✨ Contextualized Evaluations ✨!

Benchmarks like Chatbot Arena contain underspecified queries, which can lead to arbitrary eval judgments. What happens if we provide evaluators with context (e.g who's the user, what's their intent) when judging LM outputs? 🧵↓

July 30, 2025 at 6:05 AM

Reposted by Chaitanya Malaviya

Kyle Lo @ NeurIPS 2025

@kylelo.bsky.social

issues w preference LM benchmarks:

🐡data contains cases where the "bad" response is just as good as chosen one
🐟model rankings can feel off (claude ranks lower than expected)

led by @cmalaviya.bsky.social, we study underspecified queries & detrimental effect on model evals; accepted to TACL 2025

Ai2 @ai2.bsky.social · Jul 22

In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇

July 22, 2025 at 5:02 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

Context is an overlooked aspect of language model evaluations. Check out how to incorporate context into evaluations in our TACL paper, how it changes evaluation conclusions and makes evaluation more reliable!

Ai2 @ai2.bsky.social · Jul 22

In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇

July 22, 2025 at 5:12 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses?

Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓

June 6, 2025 at 4:32 PM

Reposted by Chaitanya Malaviya

Manya Wadhwa

@manyawadhwa.bsky.social

Evaluating language model responses on open-ended tasks is hard! 🤔

We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️.

EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇

April 22, 2025 at 3:04 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

Excited to share ✨ Contextualized Evaluations ✨!

Benchmarks like Chatbot Arena contain underspecified queries, which can lead to arbitrary eval judgments. What happens if we provide evaluators with context (e.g who's the user, what's their intent) when judging LM outputs? 🧵↓

November 13, 2024 at 2:16 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news