Lightnews — Scholar-powered news

Tensorlake

@tensorlake.ai

Document parsing benchmarks have been measuring the wrong thing.

We tested every major parser on real enterprise documents.

The results will change how you think about OCR accuracy 🧵

Two dense document pages flank a skeptical person’s sticker-style portrait against a green gradient, link text centered below.

November 5, 2025 at 5:05 PM

Tensorlake

@tensorlake.ai

Want to build scalable data lakes w/ Tensorlake + @qdrant.bsky.social?

In the free Qdrant Essentials Course, learn how to:
- Architect vector-powered data lakes
- Optimize ETL pipelines
- Create knowledge graphs
- Integrate @langchain.bsky.social agents for natural language queries

t.co/OoPZswrL7z

October 23, 2025 at 7:37 PM

Tensorlake

@tensorlake.ai

Most parsers strip all tracked changes when you extract the text.

That means:
❌ Lost audit trails
❌ Manual review of revision history
❌ No programmatic access to reviewer comments
❌ Workflows that can't route based on specific edits

Tensorlake interface showing parsed Word document with tracked changes preserved as HTML tags, displaying an insurance claim report

October 10, 2025 at 5:25 PM

Tensorlake

@tensorlake.ai

OCR engines constantly mess up document hierarchy.

Section 2.2 becomes a top-level header (##) instead of nested (###).

We just shipped automatic header correction.

🧵 How it works:

October 2, 2025 at 4:21 PM

Tensorlake

@tensorlake.ai

Step 3: Generating AI responses with verifiable citations

Once your chunks carry anchors, retrieval doesn’t change. You can use the dense, hybrid, or reranker setup you already have. Consider hiding the anchors in prose, while keeping them in output and making IDs clickable.

RAG citation workflow diagram on dark green background showing document processing pipeline: Document (PDF/Image) → Tensorlake Document AI → Parsed Elements (Text, Tables, Figures, and Bounding Box) → merge and insert anchors → Chunks and Anchors (Clean text and citation IDs) → splits to Citation Metadata (page, bounding box, citation IDs) and Vector DB (embeddings, text, and metadata). URL: https://tlake.link/blog/rag-citations

September 19, 2025 at 5:44 PM

Tensorlake

@tensorlake.ai

Step 2: Create contextualized chunks

Iterate through page fragment objects and create appropriately sized chunks by combining them. As you create the chunks, you can create contextualized metadata to help during retrieval.

Before and after comparison of document chunking on dark green background. Top panel "Without Contextualized Chunking" shows plain text: "SMOTE creates a broader decision region for the minority class...". Bottom panel "With Contextualized Chunking" shows same text with citation anchor "<c>2.1</c>" and metadata: {"2.1": {"page": 23, "bbox": {...}}}. URL: https://tlake.link/blog/rag-citations

September 19, 2025 at 5:44 PM

Tensorlake

@tensorlake.ai

Step 1: Parse docs with bounding boxes

Using our Document AI API you get a full document layout. For each page fragment you have access to the page number, fragment type, content, and bounding box. Making it easy to add metadata and anchor points to chunks before embedding.

Tensorlake Document AI interface showing document layout analysis with JSON output on left displaying fragment types, content, and bounding box coordinates, and PDF preview on right with highlighted text regions and yellow bounding boxes overlaid on research paper content

September 19, 2025 at 5:44 PM

Tensorlake

@tensorlake.ai

In finance, clinical trials, or performance benchmarks, dense tables contain mission-critical data.

But flatten that data like most parsers do and trust is lost.

Tensorlake restores trust by preserving structure, generating summaries for effective embeddings, and attaching evidence via b-boxes.

September 11, 2025 at 8:00 PM

Tensorlake

@tensorlake.ai

You can now login into Tensorlake using Microsoft and Azure SSO credentials.

This is the beginning of better integration with Microsoft Azure and Tensorlake.

If you are using Azure, and need better Document Ingestion and ETL for unstructured data reach out to us!

September 5, 2025 at 5:49 PM

Tensorlake

@tensorlake.ai

To build trustworthy AI, your data needs proof.

Get citations for every field extracted with Tensorlake.

Read the blog and try our citations with the example notebooks: tlake.link/blog/citations

September 5, 2025 at 4:30 PM

Tensorlake

@tensorlake.ai

Step 4: Test your context-aware agents

This is the fun part, use Tensorlake to extract key claims from news articles, then use your @langchain.bsky.social agent to query ChromaDB and determine whether the claims are rooted in fact.

Code snippet styled in a terminal window with green background. The Python code defines a Pydantic model NewsArticle with two fields: article_key_points as a list of strings describing key points of the article, and article_summary as a string summarizing the article. It then defines structured_extraction_options with a StructuredExtractionOptions object using the NewsArticle schema. Finally, it calls doc_ai.parse_and_wait with file=article and the structured extraction options, assigning the result to article_result.

Code-styled summary window with green background. Three claims about Tesla filings are listed: Claim: Record deliveries/deployments in Q4 2024 — Supported by Filings? YES — Notes: Clearly stated in filings. Claim: Deliveries/deployments indicate quarterly financials/profits — Supported by Filings? NO — Notes: Explicitly contradicted by filings. Claim: Tesla’s profits or net income figures for Q4 2024 — Supported by Filings? NO — Notes: Not yet released; filings only preview data.

August 21, 2025 at 2:59 AM

Tensorlake

@tensorlake.ai

Step 3: Contextualize Queries

Don't rely on users to make queries that are specific for your data. Instead, make sure you contextualize your query so that during hybrid search you're finding the most relevant and accurate chunks.

In our example, we used @langchain.bsky.social to help

Code snippet styled in a terminal window with green background. The Python function create_query(state: State) uses llm.invoke to generate a natural language query for searching a vector database of Tesla SEC filings. It prints the query content and returns a dictionary containing a messages list with an AIMessage holding the query, along with the query string itself.

August 21, 2025 at 2:59 AM

Tensorlake

@tensorlake.ai

Step 2: Chunk, Store, and Retrieve Data

With clean, structured, and accurate data you can chunk and embed your documents in a way that is effectively and accurately retrievable by agents.

In our example, we used @chonkieai.bsky.social and ChromaDB.

Code snippet styled in a terminal window with green background. The Python dictionary chunk_data is being built, containing keys such as id, pdf_url, chunk_index, text, start_index, end_index, and metadata. The metadata includes nested fields like source_type set to 'tesla_sec_filing', pdf_url, chunk_id, total_chunks, chunk_index, filing_date pulled from structured_data, key_points, and page_classifications.

August 21, 2025 at 2:59 AM

Tensorlake

@tensorlake.ai

Step 1: Ingest and pre-process your documents

With a single API call, you can turn messy PDFs into page-aware, table-preserving, structured context. Simply:
1. Define page classes for different documents
2. Define schemas to extract relevant data
3. Parse

Code snippet displayed in a terminal-style window with green background. The Python code shows a call to doc_ai.parse_and_wait passing arguments: file=filing_url, structured_extraction_options=structured_extraction_options, and page_classifications=page_classifications, with the result assigned to a variable named result.

August 21, 2025 at 2:59 AM

Tensorlake

@tensorlake.ai

“RAG is dead” is lazy.

What’s dead is cosine‑N without a retrieval plan.

We ship advanced RAG...out of the box:
• Classify pages → target sections
• Extract structured fields → filter by form_type, fiscal_period
• Verify data; cite page/bbox

Want to know how? 🧵👇

August 21, 2025 at 2:59 AM

Tensorlake

@tensorlake.ai

Most "unstructured" parses fail on when layout gets tricky:
multiple columns, fragmented text blocks, mixed reading order

Tensorlake doesn't.

✅ Authors parsed as one clean chunk
✅ Abstract follows, exactly as it should

Unstructured ≠ unordered

Preserve reading order. Parse with Tensorlake.

July 14, 2025 at 4:38 PM

Tensorlake

@tensorlake.ai

🚀 We’re in the top 3 on Product Hunt today just 6.5 hours after launch!

Huge thanks to everyone supporting Tensorlake 🎉

From devs wrangling PDFs to teams automating high-stakes workflows.

If you haven’t yet, check us out 👇

May 16, 2025 at 1:27 PM

Tensorlake

@tensorlake.ai

We just launched Tensorlake Cloud on Product Hunt 🎉

If you’ve dealt with messy document workflows and trying to parse complex documents (insurance claims, financial docs, multi-page forms), this is for you.

Would love your support 💚
www.producthunt.com/products/te...

May 16, 2025 at 7:04 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news