Mohit
banner
mohitmayank.com
Mohit
@mohitmayank.com
The AI Guy | Helping Startups 10x AI Game 🤖 | Author “Lazy Data Science Guide” | Creator of Jaal
Full breakdown in this excellent artile here: ngrok.com/blog/prompt...
Prompt caching: 10x cheaper LLM tokens, but how? | ngrok blog
A far more detailed explanation of prompt caching than anyone asked for.
ngrok.com
December 21, 2025 at 6:17 AM
Result? Up to 85% faster inference on long prompts, 10x cheaper costs - all because you're not recomputing what you already computed.

One catch: you need exact prefix matches. Change even one token early on, and the entire cache is invalidated.
December 21, 2025 at 6:17 AM
With caching enabled, the model stores these vectors. Next time you send a prompt with the same prefix, it skips the computation entirely and jumps straight to processing your new tokens.
December 21, 2025 at 6:17 AM
The intermediate key-value vectors from the attention mechanism. When your prompt flows through the transformer, each token generates these vectors - and that computation is expensive.
December 21, 2025 at 6:17 AM
You can verify this yourself: send identical prompts multiple times with caching enabled. You'll get different responses each time, even though the API confirms cached tokens were used.

So what's actually being cached?
December 21, 2025 at 6:17 AM
December 17, 2025 at 6:17 AM
The real innovation? This scales from 14 fixed categories to unlimited routing rules. Enterprises can add custom domains via LoRA, define complex logic, and orchestrate plugins - all while maintaining interpretability where it matters.
December 17, 2025 at 3:32 AM
Then combines them with flexible AND/OR logic to make intelligent routing decisions.

Example: "Urgent security vulnerability in auth code" now captures:
• Urgency signal → immediate attention
• Security signal → jailbreak protection
• Code review intent → reasoning capabilities
December 17, 2025 at 3:32 AM
The new architecture extracts multiple signals simultaneously:

→ Keyword signals (regex-based, fully interpretable)
→ Embedding signals (semantic understanding at scale)
→ Domain signals (MMLU + custom LoRA adapters)
December 17, 2025 at 3:32 AM
vLLM's model routing has improved over time. The old approach? Classify a query into one of 14 MMLU categories, route to a model. Simple, but it misses everything that matters - urgency, security sensitivity, intent complexity, compliance needs.
December 17, 2025 at 3:32 AM
Wrote a complete guide with code walkthrough using Gemma-3 and SNAC codec: mohitmayank.com/a_lazy_data...

4/4
December 15, 2025 at 11:33 AM
Why this works:
- Transfer learning from pretrained models
- Compatible with standard LLM architectures
- Scales efficiently for long sequences
- Easy to extend for multi-speaker scenarios

The pipeline is simple: Text → LLM → Audio Tokens → Neural Codec Decoder → Audio

3/4
December 15, 2025 at 11:33 AM
The idea: treat audio generation as a sequence problem. Use neural codecs like SNAC to compress audio into discrete tokens (180k samples → 679 tokens), then fine-tune a pretrained LLM to predict those tokens from text.

2/4
December 15, 2025 at 11:33 AM
FAE (Feature Auto-Encoder) solves this with elegant simplicity - a single-attention encoder compresses high-dim features into compact latent codes, then a double decoder (feature + pixel) reconstructs both semantic meaning and final images. The result?
December 15, 2025 at 6:11 AM
Image generation models want low-dimensional spaces for efficiency. Pre-trained visual encoders need high-dimensional features for understanding. This tension usually requires complex workarounds.
December 15, 2025 at 6:11 AM