Be warned, I did fire up the meme generator for my slides...
Be warned, I did fire up the meme generator for my slides...
Full script at huggingface.co/datasets/uv-...
Full script at huggingface.co/datasets/uv-...
Currently processing @natlibscot.bsky.social's 27,915-page handbook collection with one command.
Processing at ~350 images/sec on A100
Using @hf.co Jobs + uv - zero setup batch OCR!
Will share final time + cost when done!
Currently processing @natlibscot.bsky.social's 27,915-page handbook collection with one command.
Processing at ~350 images/sec on A100
Using @hf.co Jobs + uv - zero setup batch OCR!
Will share final time + cost when done!
With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases.
Follow the org to keep up-to-date!
huggingface.co/small-models...
With @wjbmattingly.bsky.social I'm launching small-models-for-glam on @hf.co to create/curate models that run on modest hardware and address GLAM use cases.
Follow the org to keep up-to-date!
huggingface.co/small-models...
Nanonets just released OCR2 - a 3B parameter vision-language model for document OCR 📄
You can run it with one command on @hf.co Jobs (no local GPU needed)
Nanonets just released OCR2 - a 3B parameter vision-language model for document OCR 📄
You can run it with one command on @hf.co Jobs (no local GPU needed)
I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @hf.co Jobs
Tested on 1800s library cards - works great ✨
I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @hf.co Jobs
Tested on 1800s library cards - works great ✨
I uploaded two new @hf.co datasets (~470K cards) for training/evaluating models to extract structured metadata from catalogue cards.
I uploaded two new @hf.co datasets (~470K cards) for training/evaluating models to extract structured metadata from catalogue cards.
Libraries are starting to explore AI-assisted cataloguing, but we lack public evaluation data. Hoping this helps fill that gap.
huggingface.co/datasets/big...
Libraries are starting to explore AI-assisted cataloguing, but we lack public evaluation data. Hoping this helps fill that gap.
huggingface.co/datasets/big...
iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)
Trained with @hf.co TRL + Jobs - single UV script, no GPU needed!
Blog soon!
iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)
Trained with @hf.co TRL + Jobs - single UV script, no GPU needed!
Blog soon!
It processes images from any dataset and outputs a new dataset with extracted markdown - all using HF GPUs.
See the full OCR uv scripts collection: huggingface.co/datasets/uv-...
It processes images from any dataset and outputs a new dataset with extracted markdown - all using HF GPUs.
See the full OCR uv scripts collection: huggingface.co/datasets/uv-...
NuMarkdown-8B-Thinking from NuMind (YC S22) doesn't just extract text - it reasons through documents first.
Could be pretty valuable for weird historical documents?
Example here: davanstrien-ocr-time-capsule.static.hf.space/index.html?d...
NuMarkdown-8B-Thinking from NuMind (YC S22) doesn't just extract text - it reasons through documents first.
Could be pretty valuable for weird historical documents?
Example here: davanstrien-ocr-time-capsule.static.hf.space/index.html?d...
How well do these models handle Victorian theatre playbills from @bldigischol.bsky.social?
RolmOCR vs traditional OCR on tricky playbills (ornate fonts, faded ink, DRAMATIC ALL CAPS!)
@hf.co Demo: huggingface.co/spaces/davan...
How well do these models handle Victorian theatre playbills from @bldigischol.bsky.social?
RolmOCR vs traditional OCR on tricky playbills (ornate fonts, faded ink, DRAMATIC ALL CAPS!)
@hf.co Demo: huggingface.co/spaces/davan...
I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social
huggingface.co/spaces/davanstrien/ocr-time-capsule
I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social
huggingface.co/spaces/davanstrien/ocr-time-capsule
One command VLM based OCR with uv Scripts:
hf jobs uv run [script] ufo-images ufo-text
Classified UFO docs → clean markdown. Zero setup!
Try it → huggingface.co/datasets/uv-...
One command VLM based OCR with uv Scripts:
hf jobs uv run [script] ufo-images ufo-text
Classified UFO docs → clean markdown. Zero setup!
Try it → huggingface.co/datasets/uv-...
FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages.
Huge thanks to all who contributed!
huggingface.co/blog/davanst...
FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages.
Huge thanks to all who contributed!
huggingface.co/blog/davanst...
But are they actually better than traditional OCR engines, which output XML for historical docs?
I built OCR Time Machine to test it!
📄 Upload image + ALTO/PAGE XML
⚖️ Compare outputs side by side
🔗 huggingface.co/spaces/davan...
But are they actually better than traditional OCR engines, which output XML for historical docs?
I built OCR Time Machine to test it!
📄 Upload image + ALTO/PAGE XML
⚖️ Compare outputs side by side
🔗 huggingface.co/spaces/davan...
how should we critically engage with AI?
Can you guess how I answered the question below?!
how should we critically engage with AI?
Can you guess how I answered the question below?!
Not a perfect fix, but making ML-ready datasets from collections can help.
If you want help getting your data on @hf.co, I'd be happy to help.
Not a perfect fix, but making ML-ready datasets from collections can help.
If you want help getting your data on @hf.co, I'd be happy to help.
The serious point of this one is that the barrier to doing data work has gotten much lower in the past year or two. You don't need to be an expert in XML to do useful stuff with XML data anymore.
The serious point of this one is that the barrier to doing data work has gotten much lower in the past year or two. You don't need to be an expert in XML to do useful stuff with XML data anymore.
Features AI-powered search, parameter analysis via safetensors, and tools to find similar models/datasets.
Try: "Find non maths reasoning datasets from 2025"!
Features AI-powered search, parameter analysis via safetensors, and tools to find similar models/datasets.
Try: "Find non maths reasoning datasets from 2025"!
huggingface.co/deepseek-ai/...
huggingface.co/deepseek-ai/...
- 3M+ visual elements from historic US newspapers — photos, maps, cartoons, OCR + metadata.
- Parquet = fast filters, easier analysis.
- Great for ML + cultural research.
👉 huggingface.co/datasets/big...
- 3M+ visual elements from historic US newspapers — photos, maps, cartoons, OCR + metadata.
- Parquet = fast filters, easier analysis.
- Great for ML + cultural research.
👉 huggingface.co/datasets/big...
- 3.5K annotated historical newspaper pages
- Bounding boxes + category labels
- Photos, ads, headlines, cartoons & more
- 3.5K annotated historical newspaper pages
- Bounding boxes + category labels
- Photos, ads, headlines, cartoons & more