Lightnews — Scholar-powered news

Mohit

@mohitmayank.com

Full breakdown in this excellent artile here: ngrok.com/blog/prompt...

Prompt caching: 10x cheaper LLM tokens, but how? | ngrok blog

A far more detailed explanation of prompt caching than anyone asked for.

ngrok.com

December 21, 2025 at 6:17 AM

Mohit

@mohitmayank.com

Result? Up to 85% faster inference on long prompts, 10x cheaper costs - all because you're not recomputing what you already computed.

One catch: you need exact prefix matches. Change even one token early on, and the entire cache is invalidated.

December 21, 2025 at 6:17 AM

Mohit

@mohitmayank.com

With caching enabled, the model stores these vectors. Next time you send a prompt with the same prefix, it skips the computation entirely and jumps straight to processing your new tokens.

December 21, 2025 at 6:17 AM

Mohit

@mohitmayank.com

The intermediate key-value vectors from the attention mechanism. When your prompt flows through the transformer, each token generates these vectors - and that computation is expensive.

December 21, 2025 at 6:17 AM

Mohit

@mohitmayank.com

You can verify this yourself: send identical prompts multiple times with caching enabled. You'll get different responses each time, even though the API confirms cached tokens were used.

So what's actually being cached?

December 21, 2025 at 6:17 AM

Mohit

@mohitmayank.com

TinyTorch: mlsysbook.ai/tinytorch/m...

December 17, 2025 at 6:17 AM

Mohit

@mohitmayank.com

Full breakdown: blog.vllm.ai/2025/11/19/...

Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

The earlier versions of vLLM Semantic Router relied on classification-based routing, a straightforward approach where user queries are classified into one of 14 MMLU domain categories, and then routed to corresponding models. While this worked for basic scenarios, we quickly discovered its limitations when building production AI systems for enterprises.

blog.vllm.ai

December 17, 2025 at 3:32 AM

Mohit

@mohitmayank.com

The real innovation? This scales from 14 fixed categories to unlimited routing rules. Enterprises can add custom domains via LoRA, define complex logic, and orchestrate plugins - all while maintaining interpretability where it matters.

December 17, 2025 at 3:32 AM

Mohit

@mohitmayank.com

Then combines them with flexible AND/OR logic to make intelligent routing decisions.

Example: "Urgent security vulnerability in auth code" now captures:
• Urgency signal → immediate attention
• Security signal → jailbreak protection
• Code review intent → reasoning capabilities

December 17, 2025 at 3:32 AM

Mohit

@mohitmayank.com

The new architecture extracts multiple signals simultaneously:

→ Keyword signals (regex-based, fully interpretable)
→ Embedding signals (semantic understanding at scale)
→ Domain signals (MMLU + custom LoRA adapters)

December 17, 2025 at 3:32 AM

Mohit

@mohitmayank.com

vLLM's model routing has improved over time. The old approach? Classify a query into one of 14 MMLU categories, route to a model. Simple, but it misses everything that matters - urgency, security sensitivity, intent complexity, compliance needs.

December 17, 2025 at 3:32 AM

Mohit

@mohitmayank.com

Wrote a complete guide with code walkthrough using Gemma-3 and SNAC codec: mohitmayank.com/a_lazy_data...

4/4

December 15, 2025 at 11:33 AM

Mohit

@mohitmayank.com

Why this works:
- Transfer learning from pretrained models
- Compatible with standard LLM architectures
- Scales efficiently for long sequences
- Easy to extend for multi-speaker scenarios

The pipeline is simple: Text → LLM → Audio Tokens → Neural Codec Decoder → Audio

3/4

December 15, 2025 at 11:33 AM

Mohit

@mohitmayank.com

The idea: treat audio generation as a sequence problem. Use neural codecs like SNAC to compress audio into discrete tokens (180k samples → 679 tokens), then fine-tune a pretrained LLM to predict those tokens from text.

2/4

December 15, 2025 at 11:33 AM

Mohit

@mohitmayank.com

State-of-the-art quality with minimal architecture complexity.

Paper: www.alphaxiv.org/abs/2512.07829

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

View recent discussion. Abstract: Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents sui

www.alphaxiv.org

December 15, 2025 at 6:11 AM

Mohit

@mohitmayank.com

FAE (Feature Auto-Encoder) solves this with elegant simplicity - a single-attention encoder compresses high-dim features into compact latent codes, then a double decoder (feature + pixel) reconstructs both semantic meaning and final images. The result?

December 15, 2025 at 6:11 AM

Mohit

@mohitmayank.com

Image generation models want low-dimensional spaces for efficiency. Pre-trained visual encoders need high-dimensional features for understanding. This tension usually requires complex workarounds.

December 15, 2025 at 6:11 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news