merve
banner
merve.bsky.social
merve
@merve.bsky.social
proud mediterrenean 🧿 open-sourceress at hugging face 🤗 multimodality, zero-shot vision, vision language models, transformers
llama.cpp has vision language model support now! ❤️‍🔥

get started with sota VLMs (gemma 3, Qwen2.5VL, InternVL3 & more) and serve them wherever you want 🤩
learn more github.com/ggml-org/lla... 📖
May 11, 2025 at 7:46 AM
If you want to ✨ speed-up & harden ✨ your RAG pipelines, use visual document retrieval models ⬇️

We have shipped a how-to guide for VDR models in Hugging Face transformers 🤗📖 huggingface.co/docs/transfo...
May 2, 2025 at 9:49 AM
Why do people sleep on DSE multimodal retrieval models? 👀

They're just like ColPali, but highly scalable, fast and you can even make them more efficient with binarization or matryoshka with little degradation 🪆⚡️

I collected some here huggingface.co/collections/...
April 15, 2025 at 4:26 PM
I'm so hooked on @hf.co Inference Providers (specifically Qwen2.5-VL-72B) for multimodal agentic workflows with smolagents 🥹

get started ⤵️
> filter models provided by different providers
> test them through widget or Python/JS/cURL
April 15, 2025 at 2:59 PM
my weekly summary on what's released in open AI is up on @hf.co huggingface.co/posts/merve/...

collection is here huggingface.co/collections/...
April 14, 2025 at 12:24 PM
fan-favorite open-source PDF rendering model OlmOCR goes faster and more efficient ⚡️

RolmOCR-7B follows same recipe with OlmOCR, builds on Qwen2.5VL with training set modifications and improves accuracy & performance 🤝

huggingface.co/reducto/Rolm...
April 14, 2025 at 8:51 AM
the model also has impressive OCR capabilities ⬇️
April 11, 2025 at 7:10 PM
we'll give this model a test on agentic capabilities but here's an example from paper:
April 11, 2025 at 7:09 PM
This model consists of a dynamic res handling MoonViT encoder, a projection layer and a 16B MoE decoder (with 2.8B active params)

the paper introduces an interesting pre-training pipeline to handle long context and the model saw 4.4T tokens arxiv.org/pdf/2504.07491
April 11, 2025 at 7:08 PM
DO NOT SLEEP ON THIS MODEL

Kimi-VL-A3B-Thinking is the first ever capable open-source reasoning VLM with MIT license ❤️
> it has only 2.8B activated params 👏
> it's agentic 🔥 works on GUIs
> surpasses gpt-4o

I've put it to test (see below ⤵️) huggingface.co/spaces/moons...
April 11, 2025 at 7:08 PM
InternVL3 is out 💥

> 7 ckpts with various sizes (1B to 78B)
> Built on InternViT encoder and Qwen2.5VL decoder, improves on Qwen2.5VL
> Can do reasoning, document tasks, extending to tool use and agentic capabilities 🤖
> easily use with Hugging Face transformers 🤗 huggingface.co/collections/...
April 11, 2025 at 1:35 PM
All the multimodal document retrieval models (ColPali, DSE et al) are now under visual document retrieval at @hf.co 📝🤗

take your favorite VDR model out for multimodal RAG 🤝
February 26, 2025 at 11:39 AM
Everything that was released passed week in open AI 🤠

> Link to all models, datasets, demos huggingface.co/collections/...
> Text-readable version is here huggingface.co/posts/merve/...
January 17, 2025 at 3:28 PM
there's a new multimodal retrieval model in town 🤠
@llamaindex.bsky.social released vdr-2b-multi-v1
> uses 70% less image tokens, yet outperforming other dse-qwen2 based models
> 3x faster inference with less VRAM 💨
> shrinkable with matryoshka 🪆
huggingface.co/collections/...
January 13, 2025 at 11:11 AM
What a week to open the year in open ML, all the things released at @hf.co 🤠

Here's everything released, find text-readable version here huggingface.co/posts/merve/...

All models are here huggingface.co/collections/...
January 10, 2025 at 2:51 PM
ViTPose -- best open-source pose estimation model just landed to @hf.co transformers 🕺🏻💃🏻

🔖 Model collection: huggingface.co/collections/...

🔖 Notebook on how to use: colab.research.google.com/drive/1e8fcb...

🔖 Try it here: huggingface.co/spaces/hysts...
January 9, 2025 at 2:27 PM

The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM 💬

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ⤵️
January 9, 2025 at 12:00 PM
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗

The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos ⏯️
January 9, 2025 at 12:00 PM
see the blog and our docs for more insights around native agentic skills of LLMs and getting started with smolagents, courtesy of the amazing
@m--ric.bsky.social

> Blog: hf.co/blog/smolage...
> Quickstart: huggingface.co/docs/smolage...
December 31, 2024 at 3:39 PM
you can still do traditional tool calling where you can do tool calling with JSON

writing a tool and using it is very easy, just decorate the function with `@tool`

what's cooler is that you can push and pull tools from Hugging Face Hub! see below
December 31, 2024 at 3:39 PM
It is very easy to use CodeAgent!

Just initialize it with the tool of your choice and the model of your choice

See below how you can get started, you can use the models with HF Inference API as well as locally!
December 31, 2024 at 3:39 PM
smolagents is a barebones library to unlock both native and traditional tool calling for language models

LLMs can already write code and do reasoning, so why bother yourself with writing the tool?

CodeAgent class is here for it! see it in action below
December 31, 2024 at 3:39 PM
supercharge your LLM apps with smolagents 🔥

however cool your LLM is, without being agentic it can only go so far

enter smolagents: a new agent library by @hf.co to make the LLM write code, do analysis and automate boring stuff! huggingface.co/blog/smolage...
December 31, 2024 at 3:32 PM
ColPali is landed at @hf.co transformers and I have just shipped a very lean fine-tuning tutorial in smol-vision 🤠💗

QLoRA fine-tuning with 4-bit with bsz of 4 can be done with 32 GB VRAM and is very fast! ✨
github.com/merveenoyan/...
December 20, 2024 at 3:53 PM
you can now stay up-to-date with big AI research labs' updates on @hf.co easily over org activity page 🥹

I have been looking forward to this feature as I felt most back to back releases are overwhelming and I tend to miss out 🤠
December 20, 2024 at 1:09 PM