esteng.github.io
1⃣ arxiv.org/abs/2410.14596
2⃣ arxiv.org/abs/2503.15272
3⃣ arxiv.org/abs/2409.07394
With awesome collaborators @mohitbansal.bsky.social, @peterbhase.bsky.social, David Wan, @cyjustinchen.bsky.social, Han Wang, @archiki.bsky.social
1⃣ arxiv.org/abs/2410.14596
2⃣ arxiv.org/abs/2503.15272
3⃣ arxiv.org/abs/2409.07394
With awesome collaborators @mohitbansal.bsky.social, @peterbhase.bsky.social, David Wan, @cyjustinchen.bsky.social, Han Wang, @archiki.bsky.social
📆 05/01 2PM: MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
📆 05/02 11AM: AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge
📆 05/01 2PM: MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
📆 05/02 11AM: AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge
Code: github.com/atinpothiraj...
@hf.co Dataset: huggingface.co/datasets/ati...
Paper: arxiv.org/abs/2504.15485
Code: github.com/atinpothiraj...
@hf.co Dataset: huggingface.co/datasets/ati...
Paper: arxiv.org/abs/2504.15485
➡️ Providing object coordinates as text improves performance substantially.
➡️ Providing diffusion-based inpainting also helps.
➡️ Providing object coordinates as text improves performance substantially.
➡️ Providing diffusion-based inpainting also helps.
Additionally, model performance depends on pattern type (the shape in which the objects are arranged).
Additionally, model performance depends on pattern type (the shape in which the objects are arranged).
Models generally struggle with multiple aspects of the task (occluded and unoccluded)
Crucially, every model performs worse in the occluded setting but we find that humans can perform the task easily even with occlusion.
Models generally struggle with multiple aspects of the task (occluded and unoccluded)
Crucially, every model performs worse in the occluded setting but we find that humans can perform the task easily even with occlusion.
➡️ CAPTURe-real contains real-world images and tests the ability of models to perform amodal counting in naturalistic contexts.
➡️ CAPTURe-synthetic allows us to analyze specific factors by controlling different variables like color, shape, and number of objects.
➡️ CAPTURe-real contains real-world images and tests the ability of models to perform amodal counting in naturalistic contexts.
➡️ CAPTURe-synthetic allows us to analyze specific factors by controlling different variables like color, shape, and number of objects.
This needs pattern recognition + counting, making it a good testbed for VLMs!
This needs pattern recognition + counting, making it a good testbed for VLMs!