Cem Koç
cemkoch.bsky.social
Cem Koç
@cemkoch.bsky.social
Coffee Lover • Husky Dad • ML Researcher @  • Berkeley Grad
Huge thanks to amazing to the amazing people:
@pavankumarvasu.bsky.social, Fartash Faghri, Chun-Liang Li, Hadi Pouransari, @onceltuzel.bsky.social, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Christopher Webb
May 7, 2025 at 10:26 PM
For more, check out our paper on arxiv: arxiv.org/abs/2412.13303

With the amazing people: @pavankumarvasu.bsky.social , Fartash Faghri, Chun-Liang Li, Hadi Pouransari, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, and @onceltuzel.bsky.social
FastVLM: Efficient Vision Encoding for Vision Language Models
Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders su...
arxiv.org
December 19, 2024 at 7:22 PM
What is exciting is that FastVLM model family (VLMs with FastViTHD vision backbone) scales very well with more SFT data, which is vital, and achieves SOTA performance while being significantly faster 🚀
December 19, 2024 at 7:10 PM
We ran multiple experiments comparing different resolution sizes (256, 512, 768, 1024) and LLM sizes (0.5B, 1.5B, 7B) to find the optimal setup. FastViTHD's Pareto-optimal curve shows significant gains over FastViT (which is already better than ViTs)👇
December 19, 2024 at 6:58 PM
Text-rich tasks require high image resolutions which increase the vision encoding latency + number of image tokens which then leads to higher LLM pre-filling time. Therefore instead of using an isotropic architecture we use a hybrid vision backbone that can scale to higher input resolutions.
December 19, 2024 at 6:50 PM
We measure time-to-first-token (TTFT) as the wait time to get the first token response from the VLM which combines the Vision Encoder Latency + LLM pre-filling time (time it takes for LLM to fill the KV-cache and output its first token) and at high resolutions vision encoder latency dominates.
December 19, 2024 at 6:42 PM
FastVLM incorporates FastViTHD, a novel hybrid vision encoder backbone designed to output fewer image tokens and significantly reduce the encoding time for high resolution images.
December 19, 2024 at 6:34 PM