Initial seed noise matters. And you can optimize it **without** any backprop through your denoiser via good-ol linearization. Importantly, you need to do this in the Fourier space.
Initial seed noise matters. And you can optimize it **without** any backprop through your denoiser via good-ol linearization. Importantly, you need to do this in the Fourier space.
image (point map) + mask -> Transformer -> pose / voxel -> transformer (image/mask/voxel) -> mesh / splat. Staged training with synthetic and real data & RL. I wish I could see more failure examples to know the limits.
image (point map) + mask -> Transformer -> pose / voxel -> transformer (image/mask/voxel) -> mesh / splat. Staged training with synthetic and real data & RL. I wish I could see more failure examples to know the limits.
Train a confidence predictor for tokens and merge low-confidence ones for acceleration -> faster reconstruction with VGGT/MapAnything.
Train a confidence predictor for tokens and merge low-confidence ones for acceleration -> faster reconstruction with VGGT/MapAnything.
VGGT-like architecture, but simplified to estimate depth and ray maps (not point maps). Uses teacher-student training of Depth Anything v2.
VGGT-like architecture, but simplified to estimate depth and ray maps (not point maps). Uses teacher-student training of Depth Anything v2.
Make a rough warp, push it through Image-to-Video model with denoise together up until a timestep, then let it finish the rest without interference.
Make a rough warp, push it through Image-to-Video model with denoise together up until a timestep, then let it finish the rest without interference.
I like simple ideas -- this one says you should consider multiple views when you prune/clone, which allows fewer Gaussians to be used for training.
I like simple ideas -- this one says you should consider multiple views when you prune/clone, which allows fewer Gaussians to be used for training.
Extract Dynamic 3D Gaussians for an object -> Vision Language Models to extract physics parameters -> model force field (wind). Leads to some fun.
Extract Dynamic 3D Gaussians for an object -> Vision Language Models to extract physics parameters -> model force field (wind). Leads to some fun.
VGGT extended to dynamic scenes with a dynamic mask predictor.
VGGT extended to dynamic scenes with a dynamic mask predictor.
Tracking, waaaaaay back in the days, used to be solved using sampling methods. They are now back. Also reminds me of my first major conference work, where I looked into how much impact the initial target point has.
Tracking, waaaaaay back in the days, used to be solved using sampling methods. They are now back. Also reminds me of my first major conference work, where I looked into how much impact the initial target point has.
Make your RoPE encoding 3D by including a z axis, then manipulate your image by simply manipulating your positional encoding in 3D --> novel view synthesis. Neat idea.
Make your RoPE encoding 3D by including a z axis, then manipulate your image by simply manipulating your positional encoding in 3D --> novel view synthesis. Neat idea.