asaf-yehudai.bsky.social
@asaf-yehudai.bsky.social
Checkout our full leaderboard here:
huggingface.co/spaces/ibm/J...
JuStRank - a Hugging Face Space by ibm
Discover amazing ML apps made by the community
huggingface.co
December 13, 2024 at 10:17 AM
Many more details are in the paper:
huggingface.co/papers/2412....

Thanks for the amazing collaborators: Ariel Gera, Odellia Boni, @yperlitz.bsky.social, Roy Bar-Haim, Lilach Eden, from IBM Research.
Paper page - JuStRank: Benchmarking LLM Judges for System Ranking
Join the discussion on this paper page
huggingface.co
December 13, 2024 at 10:17 AM
Overall, we found:
1⃣strong correlation between judge ranking abilities and decisiveness
2⃣and Negative correlation with its tendency for System-specific biases
December 13, 2024 at 10:16 AM
Surprisingly, we found that self-bias is less prevalent than we thought
December 13, 2024 at 10:16 AM
Secondly, we define a new type of Bias:

System-specific bias

Where a judge prefers or dislikes a specific system

Our results demonstrate large biases that affect systems-ranking
December 13, 2024 at 10:16 AM
Analyzing these figures, we found an emergent judge behavior:

We call it decisiveness!
decisive judges prefer stronger systems, more than humans do!

We measure it based on the empirical fit
December 13, 2024 at 10:16 AM
What does JuStRank tell us about general judge behavior?

For that, we turn to the system preference task
Given a pair of systems, which one is better!

We plot gold and judge predicted win-rates
December 13, 2024 at 10:16 AM
With JuStRank we found:
1⃣Smaller dedicated judges are on par with big ones
2⃣LLM judge's realization matters a lot
3⃣Comparative judgment is not the best for most judges

🕺💃
December 13, 2024 at 10:16 AM
So how did we do it?

For LLMs, we took 4 unique realizations
➕ Reward models
they judge the responses of 64 systems
and got each judge's system ranking

Then we compare the ranking to Arena's gold rank
December 13, 2024 at 10:16 AM
There are many new judge benchmarks
But most focus on evaluating the judge's ability to choose a better response

We focus on the judge's ability to choose a better system
December 13, 2024 at 10:16 AM