Dimitrije Antić
anticdimi.bsky.social
Dimitrije Antić
@anticdimi.bsky.social
CV & ML Ph.D. student at @uva.nl | prev. Univ. of Tuebingen, MPI-IS | Teaching machines to perceive humans. | anticdimi.github.io
Reposted by Dimitrije Antić
To bridge this 2D-to-3D gap, we propose "Render-Localize-Lift":
- Render: 3D human/object meshes into multiview 2D images.
- Localize: A Multiview Localization (MV-Loc) model, guided by VLM tokens, predicts 2D contact masks.
- Lift: 2D contact masks to 3D.
(5/10)
June 15, 2025 at 12:23 PM
Reposted by Dimitrije Antić
How can we infer 3D contact with limited 3D data? InteractVLM exploits foundational models—a VLM & localization model fine tuned to reason about contact. Given an image & prompt, the VLM outputs tokens for localization. But these models work in 2D, while contact is 3D. (4/10)
June 15, 2025 at 12:23 PM