https://aarashfeizi.github.io/
📢 Shoutout to my amazing co-authors and to ServiceNow Research and Mila for making this happen! 🚀
📄 Read the full paper: arxiv.org/abs/2502.15210
#PairBench #LLMs #VLMs #GenAI #AutoEval
📢 Shoutout to my amazing co-authors and to ServiceNow Research and Mila for making this happen! 🚀
📄 Read the full paper: arxiv.org/abs/2502.15210
#PairBench #LLMs #VLMs #GenAI #AutoEval
✅ Beyond benchmarking, PairBench can be used during VLM training & fine-tuning to detect biases early and improve evaluation methods!
This could lead to more trustworthy, consistent AI systems for real-world tasks. 🚀
✅ Beyond benchmarking, PairBench can be used during VLM training & fine-tuning to detect biases early and improve evaluation methods!
This could lead to more trustworthy, consistent AI systems for real-world tasks. 🚀
✅ PairBench correlates strongly with existing benchmarks, meaning it can serve as a low-cost alternative to expensive human-annotated benchmarks!
This makes it easier to compare and rank models efficiently—without excessive computational costs.
✅ PairBench correlates strongly with existing benchmarks, meaning it can serve as a low-cost alternative to expensive human-annotated benchmarks!
This makes it easier to compare and rank models efficiently—without excessive computational costs.
Instead of blindly picking a judge model, we should ask:
🔹 What task is being evaluated?
🔹 What metric matters most?
✅ PairBench helps match the right VLM to the right task, improving fairness & reliability in auto-evaluation.
Instead of blindly picking a judge model, we should ask:
🔹 What task is being evaluated?
🔹 What metric matters most?
✅ PairBench helps match the right VLM to the right task, improving fairness & reliability in auto-evaluation.
🚨 No single VLM is the best! Models vary drastically across PairBench metrics.
Although some align well with human judgements, they may struggle at symmetry, smoothness, or controllability—making their scores unreliable!
📄 More failure cases in our paper’s appendix!
🚨 No single VLM is the best! Models vary drastically across PairBench metrics.
Although some align well with human judgements, they may struggle at symmetry, smoothness, or controllability—making their scores unreliable!
📄 More failure cases in our paper’s appendix!
✅ Surprising (and concerning) result: Most VLMs lack symmetry! 🤯
In theory, sim(A, B) = sim(B, A)—but in practice? Many models fail!
For example, simply swapping the order of the input images makes GPT-4o and Gemini 1.5 Pro change their decision and scores drastically. 🔄
✅ Surprising (and concerning) result: Most VLMs lack symmetry! 🤯
In theory, sim(A, B) = sim(B, A)—but in practice? Many models fail!
For example, simply swapping the order of the input images makes GPT-4o and Gemini 1.5 Pro change their decision and scores drastically. 🔄
Vision language models (VLMs) are widely used as automated evaluators, but can they actually compare data reliably? 🤔
✅ PairBench systematically tests how well VLMs judge similarity across modalities, revealing key strengths & weaknesses in their decisions.
Vision language models (VLMs) are widely used as automated evaluators, but can they actually compare data reliably? 🤔
✅ PairBench systematically tests how well VLMs judge similarity across modalities, revealing key strengths & weaknesses in their decisions.