Lightnews — Scholar-powered news

asaf-yehudai.bsky.social

@asaf-yehudai.bsky.social

Yes!
huggingface.co/spaces/ibm/J...

JuStRank - a Hugging Face Space by ibm

Discover amazing ML apps made by the community

huggingface.co

December 13, 2024 at 1:06 PM

asaf-yehudai.bsky.social

@asaf-yehudai.bsky.social

Checkout our full leaderboard here:
huggingface.co/spaces/ibm/J...

JuStRank - a Hugging Face Space by ibm

Discover amazing ML apps made by the community

huggingface.co

December 13, 2024 at 10:17 AM

asaf-yehudai.bsky.social

@asaf-yehudai.bsky.social

Many more details are in the paper:
huggingface.co/papers/2412....

Thanks for the amazing collaborators: Ariel Gera, Odellia Boni, @yperlitz.bsky.social, Roy Bar-Haim, Lilach Eden, from IBM Research.

Paper page - JuStRank: Benchmarking LLM Judges for System Ranking

Join the discussion on this paper page

huggingface.co

December 13, 2024 at 10:17 AM

asaf-yehudai.bsky.social

@asaf-yehudai.bsky.social

Overall, we found:
1⃣strong correlation between judge ranking abilities and decisiveness
2⃣and Negative correlation with its tendency for System-specific biases

December 13, 2024 at 10:16 AM

asaf-yehudai.bsky.social

@asaf-yehudai.bsky.social

Surprisingly, we found that self-bias is less prevalent than we thought

December 13, 2024 at 10:16 AM

asaf-yehudai.bsky.social

@asaf-yehudai.bsky.social

Secondly, we define a new type of Bias:

System-specific bias

Where a judge prefers or dislikes a specific system

Our results demonstrate large biases that affect systems-ranking

December 13, 2024 at 10:16 AM

asaf-yehudai.bsky.social

@asaf-yehudai.bsky.social

Analyzing these figures, we found an emergent judge behavior:

We call it decisiveness!
decisive judges prefer stronger systems, more than humans do!

We measure it based on the empirical fit

December 13, 2024 at 10:16 AM

asaf-yehudai.bsky.social

@asaf-yehudai.bsky.social

What does JuStRank tell us about general judge behavior?

For that, we turn to the system preference task
Given a pair of systems, which one is better!

We plot gold and judge predicted win-rates

December 13, 2024 at 10:16 AM

asaf-yehudai.bsky.social

@asaf-yehudai.bsky.social

With JuStRank we found:
1⃣Smaller dedicated judges are on par with big ones
2⃣LLM judge's realization matters a lot
3⃣Comparative judgment is not the best for most judges

🕺💃

December 13, 2024 at 10:16 AM

asaf-yehudai.bsky.social

@asaf-yehudai.bsky.social

So how did we do it?

For LLMs, we took 4 unique realizations
➕ Reward models
they judge the responses of 64 systems
and got each judge's system ranking

Then we compare the ranking to Arena's gold rank

December 13, 2024 at 10:16 AM

asaf-yehudai.bsky.social

@asaf-yehudai.bsky.social

There are many new judge benchmarks
But most focus on evaluating the judge's ability to choose a better response

We focus on the judge's ability to choose a better system

December 13, 2024 at 10:16 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news