Lightnews — Scholar-powered news

sta8is.bsky.social

@sta8is.bsky.social

🚀 The architecture demonstrates significant performance improvements with extended training—indicating substantial potential for future enhancements (8/n)

February 26, 2025 at 7:57 PM

sta8is.bsky.social

@sta8is.bsky.social

💡 Our multimodal approach significantly outperforms single-modality variants, demonstrating the power of learning cross-modal relationships (7/n)

February 26, 2025 at 7:57 PM

sta8is.bsky.social

@sta8is.bsky.social

📈 Results are impressive! We achieve state-of-the-art performance in future semantic segmentation on Cityscapes, with strong improvements in both short-term (0.18s) and mid-term (0.54s) predictions (6/n)

February 26, 2025 at 7:57 PM

sta8is.bsky.social

@sta8is.bsky.social

🎭 Key innovation #3: We developed a novel multimodal masked visual modeling objective specifically designed for future prediction tasks (5/n)

February 26, 2025 at 7:57 PM

sta8is.bsky.social

@sta8is.bsky.social

🔗 Key innovation #2: Our model features an efficient cross-modality fusion mechanism that improves predictions by learning synergies between different modalities (segmentation + depth) (4/n)

February 26, 2025 at 7:57 PM

sta8is.bsky.social

@sta8is.bsky.social

🎯 Key innovation #1: We introduce a VAE-free hierarchical tokenization process integrated directly into our transformer. This simplifies training, reduces computational overhead, and enables true end-to-end optimization (3/n)

February 26, 2025 at 7:57 PM

sta8is.bsky.social

@sta8is.bsky.social

🔍 FUTURIST employs a multimodal visual sequence transformer to directly predict multiple future semantic modalities. We focus on two key modalities: semantic segmentation and depth estimation—critical capabilities for autonomous systems operating in dynamic environments (2/n)

February 26, 2025 at 7:57 PM

sta8is.bsky.social

@sta8is.bsky.social

🧵 Excited to share our latest work: FUTURIST - A unified transformer architecture for multimodal semantic future prediction, is accepted to #CVPR2025! Here's how it works (1/n)
👇 Links to the arxiv and github below

February 26, 2025 at 7:57 PM

sta8is.bsky.social

@sta8is.bsky.social

7/n 🔬Interesting discovery: The intermediate features from our transformer can actually enhance the already-strong VFM features, suggesting potential for self-supervised learning.

February 7, 2025 at 5:06 PM

sta8is.bsky.social

@sta8is.bsky.social

6/n 📊And it works amazingly well! We achieve state-of-the-art results in semantic segmentation forecasting, with strong performance across multiple tasks using a single feature prediction model.

February 7, 2025 at 5:06 PM

sta8is.bsky.social

@sta8is.bsky.social

5/n 🎨The beauty of our method? It's completely modular - different task-specific heads (segmentation, depth estimation, surface normals) can be plugged in without retraining the core model.

February 7, 2025 at 5:06 PM

sta8is.bsky.social

@sta8is.bsky.social

4/n 🔄Our approach: We train a masked feature transformer to predict how VFM features change over time. These predicted features can then be used for various scene understanding tasks!

February 7, 2025 at 5:06 PM

sta8is.bsky.social

@sta8is.bsky.social

1/n 🚀 Excited to share our latest work: DINO-Foresight, a new framework for predicting the future states of scenes using Vision Foundation Model features!
Links to the arXiv and Github 👇

February 7, 2025 at 5:06 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news