Lightnews — Scholar-powered news

Aarash Feizi

@aarashfeizi.bsky.social

Visiting Researcher at @ServiceNowRSRCH | PhD student in @mcgillu and @Mila_Quebec | Prev. @RecursionPharma

https://aarashfeizi.github.io/

Posts Replies Media Videos

Aarash Feizi

@aarashfeizi.bsky.social

🧵 7/7

📢 Shoutout to my amazing co-authors and to ServiceNow Research and Mila for making this happen! 🚀

📄 Read the full paper: arxiv.org/abs/2502.15210

#PairBench #LLMs #VLMs #GenAI #AutoEval

PairBench: A Systematic Framework for Selecting Reliable Judge VLMs

As large vision language models (VLMs) are increasingly used as automated evaluators, understanding their ability to effectively compare data pairs as instructed in the prompt becomes essential. To ad...

arxiv.org

February 27, 2025 at 8:03 PM

Aarash Feizi

@aarashfeizi.bsky.social

🧵 6/7

✅ Beyond benchmarking, PairBench can be used during VLM training & fine-tuning to detect biases early and improve evaluation methods!

This could lead to more trustworthy, consistent AI systems for real-world tasks. 🚀

February 27, 2025 at 7:57 PM

Aarash Feizi

@aarashfeizi.bsky.social

🧵 5/7

✅ PairBench correlates strongly with existing benchmarks, meaning it can serve as a low-cost alternative to expensive human-annotated benchmarks!

This makes it easier to compare and rank models efficiently—without excessive computational costs.

February 27, 2025 at 7:54 PM

Aarash Feizi

@aarashfeizi.bsky.social

🧵 4/7

Instead of blindly picking a judge model, we should ask:
🔹 What task is being evaluated?
🔹 What metric matters most?

✅ PairBench helps match the right VLM to the right task, improving fairness & reliability in auto-evaluation.

February 27, 2025 at 7:53 PM

Aarash Feizi

@aarashfeizi.bsky.social

🧵 3/7

🚨 No single VLM is the best! Models vary drastically across PairBench metrics.

Although some align well with human judgements, they may struggle at symmetry, smoothness, or controllability—making their scores unreliable!

📄 More failure cases in our paper’s appendix!

February 27, 2025 at 7:52 PM

Aarash Feizi

@aarashfeizi.bsky.social

🧵 2/7

✅ Surprising (and concerning) result: Most VLMs lack symmetry! 🤯

In theory, sim(A, B) = sim(B, A)—but in practice? Many models fail!

For example, simply swapping the order of the input images makes GPT-4o and Gemini 1.5 Pro change their decision and scores drastically. 🔄

February 27, 2025 at 7:52 PM

Aarash Feizi

@aarashfeizi.bsky.social

🧵 1/7

Vision language models (VLMs) are widely used as automated evaluators, but can they actually compare data reliably? 🤔

✅ PairBench systematically tests how well VLMs judge similarity across modalities, revealing key strengths & weaknesses in their decisions.

February 27, 2025 at 7:51 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news