huggingface.co/spaces/ibm/J...
huggingface.co/spaces/ibm/J...
huggingface.co/papers/2412....
Thanks for the amazing collaborators: Ariel Gera, Odellia Boni, @yperlitz.bsky.social, Roy Bar-Haim, Lilach Eden, from IBM Research.
huggingface.co/papers/2412....
Thanks for the amazing collaborators: Ariel Gera, Odellia Boni, @yperlitz.bsky.social, Roy Bar-Haim, Lilach Eden, from IBM Research.
1⃣strong correlation between judge ranking abilities and decisiveness
2⃣and Negative correlation with its tendency for System-specific biases
1⃣strong correlation between judge ranking abilities and decisiveness
2⃣and Negative correlation with its tendency for System-specific biases
System-specific bias
Where a judge prefers or dislikes a specific system
Our results demonstrate large biases that affect systems-ranking
System-specific bias
Where a judge prefers or dislikes a specific system
Our results demonstrate large biases that affect systems-ranking
We call it decisiveness!
decisive judges prefer stronger systems, more than humans do!
We measure it based on the empirical fit
We call it decisiveness!
decisive judges prefer stronger systems, more than humans do!
We measure it based on the empirical fit
For that, we turn to the system preference task
Given a pair of systems, which one is better!
We plot gold and judge predicted win-rates
For that, we turn to the system preference task
Given a pair of systems, which one is better!
We plot gold and judge predicted win-rates
1⃣Smaller dedicated judges are on par with big ones
2⃣LLM judge's realization matters a lot
3⃣Comparative judgment is not the best for most judges
🕺💃
1⃣Smaller dedicated judges are on par with big ones
2⃣LLM judge's realization matters a lot
3⃣Comparative judgment is not the best for most judges
🕺💃
For LLMs, we took 4 unique realizations
➕ Reward models
they judge the responses of 64 systems
and got each judge's system ranking
Then we compare the ranking to Arena's gold rank
For LLMs, we took 4 unique realizations
➕ Reward models
they judge the responses of 64 systems
and got each judge's system ranking
Then we compare the ranking to Arena's gold rank
But most focus on evaluating the judge's ability to choose a better response
We focus on the judge's ability to choose a better system
But most focus on evaluating the judge's ability to choose a better response
We focus on the judge's ability to choose a better system