Aarash Feizi
aarashfeizi.bsky.social
Aarash Feizi
@aarashfeizi.bsky.social
Visiting Researcher at @ServiceNowRSRCH | PhD student in @mcgillu and @Mila_Quebec | Prev. @RecursionPharma

https://aarashfeizi.github.io/
🧵 7/7

📢 Shoutout to my amazing co-authors and to ServiceNow Research and Mila for making this happen! 🚀

📄 Read the full paper: arxiv.org/abs/2502.15210

#PairBench #LLMs #VLMs #GenAI #AutoEval
PairBench: A Systematic Framework for Selecting Reliable Judge VLMs
As large vision language models (VLMs) are increasingly used as automated evaluators, understanding their ability to effectively compare data pairs as instructed in the prompt becomes essential. To ad...
arxiv.org
February 27, 2025 at 8:03 PM
🧵 6/7

✅ Beyond benchmarking, PairBench can be used during VLM training & fine-tuning to detect biases early and improve evaluation methods!

This could lead to more trustworthy, consistent AI systems for real-world tasks. 🚀
February 27, 2025 at 7:57 PM
🧵 5/7

✅ PairBench correlates strongly with existing benchmarks, meaning it can serve as a low-cost alternative to expensive human-annotated benchmarks!

This makes it easier to compare and rank models efficiently—without excessive computational costs.
February 27, 2025 at 7:54 PM
🧵 4/7

Instead of blindly picking a judge model, we should ask:
🔹 What task is being evaluated?
🔹 What metric matters most?

✅ PairBench helps match the right VLM to the right task, improving fairness & reliability in auto-evaluation.
February 27, 2025 at 7:53 PM
🧵 3/7

🚨 No single VLM is the best! Models vary drastically across PairBench metrics.

Although some align well with human judgements, they may struggle at symmetry, smoothness, or controllability—making their scores unreliable!

📄 More failure cases in our paper’s appendix!
February 27, 2025 at 7:52 PM
🧵 2/7

✅ Surprising (and concerning) result: Most VLMs lack symmetry! 🤯

In theory, sim(A, B) = sim(B, A)—but in practice? Many models fail!

For example, simply swapping the order of the input images makes GPT-4o and Gemini 1.5 Pro change their decision and scores drastically. 🔄
February 27, 2025 at 7:52 PM
🧵 1/7

Vision language models (VLMs) are widely used as automated evaluators, but can they actually compare data reliably? 🤔

✅ PairBench systematically tests how well VLMs judge similarity across modalities, revealing key strengths & weaknesses in their decisions.
February 27, 2025 at 7:51 PM