Hyperplane
banner
hyperplane.bsky.social
Hyperplane
@hyperplane.bsky.social
Your weekly read. From POC to Production, at scale.

🫵 Follow our substack: https://thehyperplane.substack.com/
👀 Our Ebook: https://hyperplane.gumroad.com/l/fine-tuning-stt-on-edge
Still better than no boat at all!

In all realness, code generation is a great assistant for an already great programmer 🤷
March 26, 2025 at 2:08 PM
We show more in the upcoming eBook, free for all subscribers: mlvanguards.substack.com
ML Vanguards | Substack
Escaping PoC purgatory: Your Weekly Guide to production paradise. Click to read ML Vanguards, a Substack publication.
mlvanguards.substack.com
March 26, 2025 at 10:55 AM
- Normalize & clean transcripts (remove garbage text, repeated words, weird artifacts)
- Filter out the junk
- Split (70/15/15) & push to @hf.co for easy access during training

2/2
March 26, 2025 at 10:55 AM
It's kinda free for all newsletter subscribers: mlvanguards.substack.com
ML Vanguards | Substack
Escaping PoC purgatory: Your Weekly Guide to production paradise. Click to read ML Vanguards, a Substack publication.
mlvanguards.substack.com
March 24, 2025 at 8:43 PM
Data is boring
But broken search results are worse
mlvanguards.substack.com
March 24, 2025 at 12:18 PM
6. Vector Database

A vector database like Vespa store sembeddings and enable allowing similarity searches. They also use metadata to improve relevance by associating vectors with key attributes like document type, page number, or detected visual features.

7/7
March 24, 2025 at 12:18 PM
5. Chunking Strategy

Splits documents into manageable chunks for embedding:

- Layout-based chunking is for visual embeddings.

- Text density and structure for traditional embeddings. This preserving context without overloading the vector database

6/7
March 24, 2025 at 12:18 PM
4. Embedding Models

For converting document content into vectors.

- Traditional embeddings for documents with clean text extracted via OCR.

- Vision Language Models (VLM) handle multimodal documents with complex visual structures like tables, charts, and diagrams.

5/7
March 24, 2025 at 12:18 PM
3. Decisional Algorithm

The algorithm is centralized, making informed decisions based on input from the embedding decider.

- Text-heavy documents are processed with OCR and text embedding models.

- Documents with complex layouts use visual language models (eg ColPali) instead, skipping OCR.

4/7
March 24, 2025 at 12:18 PM
2. PDF Embedding Decider

This decider analyzes the document's structure, using tools like a layout analyzer, visual element detector, or text density analyzer, to classify whether a traditional text embedding or a multimodal vision embedding is appropriate.

3/7
March 24, 2025 at 12:18 PM
1. PDF Reader

The starting point of any pipeline is the PDF reader. Its job is to extract pages and pass them downstream. A high-quality reader ensures no lost information, whether the content is text-heavy, image-dense, or filled with tables and graphs.

2/7
March 24, 2025 at 12:18 PM