sta8is.bsky.social
@sta8is.bsky.social
🚀 The architecture demonstrates significant performance improvements with extended training—indicating substantial potential for future enhancements (8/n)
February 26, 2025 at 7:57 PM
💡 Our multimodal approach significantly outperforms single-modality variants, demonstrating the power of learning cross-modal relationships (7/n)
February 26, 2025 at 7:57 PM
📈 Results are impressive! We achieve state-of-the-art performance in future semantic segmentation on Cityscapes, with strong improvements in both short-term (0.18s) and mid-term (0.54s) predictions (6/n)
February 26, 2025 at 7:57 PM
🎭 Key innovation #3: We developed a novel multimodal masked visual modeling objective specifically designed for future prediction tasks (5/n)
February 26, 2025 at 7:57 PM
🔗 Key innovation #2: Our model features an efficient cross-modality fusion mechanism that improves predictions by learning synergies between different modalities (segmentation + depth) (4/n)
February 26, 2025 at 7:57 PM
🎯 Key innovation #1: We introduce a VAE-free hierarchical tokenization process integrated directly into our transformer. This simplifies training, reduces computational overhead, and enables true end-to-end optimization (3/n)
February 26, 2025 at 7:57 PM
🔍 FUTURIST employs a multimodal visual sequence transformer to directly predict multiple future semantic modalities. We focus on two key modalities: semantic segmentation and depth estimation—critical capabilities for autonomous systems operating in dynamic environments (2/n)
February 26, 2025 at 7:57 PM
🧵 Excited to share our latest work: FUTURIST - A unified transformer architecture for multimodal semantic future prediction, is accepted to #CVPR2025! Here's how it works (1/n)
👇 Links to the arxiv and github below
February 26, 2025 at 7:57 PM
7/n 🔬Interesting discovery: The intermediate features from our transformer can actually enhance the already-strong VFM features, suggesting potential for self-supervised learning.
February 7, 2025 at 5:06 PM
6/n 📊And it works amazingly well! We achieve state-of-the-art results in semantic segmentation forecasting, with strong performance across multiple tasks using a single feature prediction model.
February 7, 2025 at 5:06 PM
5/n 🎨The beauty of our method? It's completely modular - different task-specific heads (segmentation, depth estimation, surface normals) can be plugged in without retraining the core model.
February 7, 2025 at 5:06 PM
4/n 🔄Our approach: We train a masked feature transformer to predict how VFM features change over time. These predicted features can then be used for various scene understanding tasks!
February 7, 2025 at 5:06 PM
1/n 🚀 Excited to share our latest work: DINO-Foresight, a new framework for predicting the future states of scenes using Vision Foundation Model features!
Links to the arXiv and Github 👇
February 7, 2025 at 5:06 PM