asaf-yehudai.bsky.social
@asaf-yehudai.bsky.social
Overall, we found:
1⃣strong correlation between judge ranking abilities and decisiveness
2⃣and Negative correlation with its tendency for System-specific biases
December 13, 2024 at 10:16 AM
Surprisingly, we found that self-bias is less prevalent than we thought
December 13, 2024 at 10:16 AM
Secondly, we define a new type of Bias:

System-specific bias

Where a judge prefers or dislikes a specific system

Our results demonstrate large biases that affect systems-ranking
December 13, 2024 at 10:16 AM
Analyzing these figures, we found an emergent judge behavior:

We call it decisiveness!
decisive judges prefer stronger systems, more than humans do!

We measure it based on the empirical fit
December 13, 2024 at 10:16 AM
What does JuStRank tell us about general judge behavior?

For that, we turn to the system preference task
Given a pair of systems, which one is better!

We plot gold and judge predicted win-rates
December 13, 2024 at 10:16 AM
With JuStRank we found:
1⃣Smaller dedicated judges are on par with big ones
2⃣LLM judge's realization matters a lot
3⃣Comparative judgment is not the best for most judges

🕺💃
December 13, 2024 at 10:16 AM
So how did we do it?

For LLMs, we took 4 unique realizations
➕ Reward models
they judge the responses of 64 systems
and got each judge's system ranking

Then we compare the ranking to Arena's gold rank
December 13, 2024 at 10:16 AM
There are many new judge benchmarks
But most focus on evaluating the judge's ability to choose a better response

We focus on the judge's ability to choose a better system
December 13, 2024 at 10:16 AM