Kwang Moo Yi
@kmyid.bsky.social
38 followers 39 following 20 posts
Assistant Professor of Computer Science at the University of British Columbia. I also post my daily finds on arxiv.
Posts Media Videos Starter Packs
Bruns et al., "ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training"

Train a scene coordinate regressor with "map codes" (ie, trainable inputs) so that you can train one generalizable regressor. Then, find these "map codes" to localize.
Shrivastava and Mehta et al., "Point Prompting: Counterfactual Tracking with Video Diffusion Models"

Put a red dot where you want to track, and SDEdit the video with a video model --> zero-shot point tracking. Not as good as supervised ones, but zero-shot!
Yuan et al., "LikePhys: Evaluating intuitive physics understanding in video diffusion models via likelihood preference"

I will keep promoting physics benchmark papers for video models until people stop claiming world models :) tl;dr -- Still not there yet.
Xu et al., "ReSplat: Learning Recurrent Gaussian Splats"

Feed-forward Gaussian Splatting + Learned Corrector = Fast high-quality reconstruction. Uses global + kNN attention. Reminds me of pointnet++
Xu and Lin et al., "Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers"

Append foundational features at the later stages when doing marigold-like denoising to get monocular depth. Simple straightforward idea that works.
Bamberger and Jones et al., "Carré du champ flow matching: better quality-generalisation tradeoff in generative models"

Geometric regularization of the flow manifold. Boils down to adding anisotropic Gaussian Noise to flow matching training. Neat idea, enhances generalization.
Yugay and Nguyen et al., “Visual Odometry with Transformers”

Instead of point maps, you can also directly output poses. This used to be much less accurate, but now it's the opposite. Simple architecture that directly predicts camera embeddings, which then regress rot and trans.
Chen et al., "TTT3R: 3D Reconstruction as Test-Time Training"

Cut3R + gated updates for states (test-time training layers) = fast/efficient performance of cut3r, but with high-quality estimates.
Two today: Kim et al., "How Diffusion Models Memorize" and Song and Kim et al., "Selective Underfitting in Diffusion Models"

A deep dive into how memorization and generalization happen in diffusion models. Still trying to digest what these mean. Though-provoking.
Barroso-Laguna et al., "A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features"

When contexting your feed-forward 3D point-map estimator, don't use full image pairs -- just randomly subsample! -> fast compute, more images.