Julien Gaubil
jgaubil.bsky.social
Julien Gaubil
@jgaubil.bsky.social
PhD student at École Polytechnique
Interested in Computer Vision, Geometry, and learning both at the same time

https://www.jgaubil.com/
ah yes, I see!

We definitively tried to see whether the operations implemented by the layers followed known algorithms. A least squared-based optimisation like in your paper was a good candidate, given how often Procrustes problems show up in 3D vision - but alas we couldn't identify one
November 4, 2025 at 10:00 PM
Thanks for sharing!

Is this internal iterative refinement a known phenomenon in 3D networks, or are you referring to a specific architecture?
November 4, 2025 at 7:43 PM
This was a cool project done jointly with the great Michal Stary, under the amazing supervision of @ayusht.bsky.social and @vincentsitzmann.bsky.social at MIT! [8/8]
November 4, 2025 at 7:40 PM
We presented this at the End-to-End 3D Learning Workshop at ICCV 2025, and hope it inspires more work on understanding large reconstruction models!

We’re working on a clean version of the code, and we’ll release it once yours truly are done with the CVPR deadline [7/8]
November 4, 2025 at 7:40 PM
We also find that the decoder turns 𝐬𝐞𝐦𝐚𝐧𝐭𝐢𝐜 correspondences into 𝐠𝐞𝐨𝐦𝐞𝐭𝐫𝐢𝐜 𝐜𝐨𝐫𝐫𝐞𝐬𝐩𝐨𝐧𝐝𝐞𝐧𝐜𝐞𝐬.

We identified attention heads specialized in finding correspondences across views.

We can clearly see the geometric refinement on this difficult image pair by visualizing their cross-attention maps! [6/8]
November 4, 2025 at 7:40 PM
Surprisingly, 𝐚𝐥𝐦𝐨𝐬𝐭 𝐚𝐥𝐥 𝐨𝐟 𝐭𝐡𝐞 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐦𝐞𝐧𝐭 𝐢𝐬 𝐝𝐮𝐞 𝐭𝐨 𝐬𝐞𝐥𝐟-𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐥𝐚𝐲𝐞𝐫𝐬!⁣

Nevertheless, this doesn’t mean cross-attention layers are useless - without them, no communication between views.⁣

This instead suggests that cross and self-attention layers play very different roles [5/8]
November 4, 2025 at 7:40 PM
Can we dive deeper into the network? Yes!

We can observe the impact of each layer on the iterative reconstruction process by comparing the pointmap error before and after the layer.

Here, we plot of the error difference for every layer of DUSt3R’s second-view decoder [4/8]
November 4, 2025 at 7:40 PM
We observe that 𝐫𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 𝐢𝐬 𝐚𝐧 𝐢𝐭𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐩𝐫𝐨𝐜𝐞𝐬𝐬, with decoder blocks progressively refining the pointmaps.⁣

For easy image pairs, a good estimate of the relative position emerges early in the decoder, whereas harder pairs require more decoder blocks, sometimes even failing to converge [3/8]
November 4, 2025 at 7:40 PM
To open up DUSt3R, we train individual MLP probes on intermediate layers of an early checkpoint, using the same pointmap objective.

We can then analyze its inference through the sequence of reconstructions - see below! [2/8]
November 4, 2025 at 7:40 PM
Where would understanding surface geometry (as in distances, curvatures, and so on) fit in this diagram?

I’d say it implies multi-view consistency of the geometry and would therefore add an arrow at the left of your chart. Do you agree, and if so, don’t you think we should start there?
August 13, 2025 at 11:51 AM