1. Divide the image into smaller crops.
2. Extract text segments capturing objects, attributes and relations.
3. Use the VLM to find image crops that best fit the text segments.
4. Aggregate matching similarities for the final score.
1. Divide the image into smaller crops.
2. Extract text segments capturing objects, attributes and relations.
3. Use the VLM to find image crops that best fit the text segments.
4. Aggregate matching similarities for the final score.
Can a simple inference-time approach unlock better Vision-Language Compositionality?🤯
Our latest paper shows how adding structure at inference significantly boosts performance in popular dual-encoder VLMs on different datasets.
Read more: arxiv.org/abs/2506.09691
Can a simple inference-time approach unlock better Vision-Language Compositionality?🤯
Our latest paper shows how adding structure at inference significantly boosts performance in popular dual-encoder VLMs on different datasets.
Read more: arxiv.org/abs/2506.09691