llm-d
banner
llm-d.ai
llm-d
@llm-d.ai
llm-d is a Kubernetes-native distributed inference serving stack providing well-lit paths for anyone to serve large generative AI models at scale.

Learn more at: https://llm-d.ai
A huge shoutout to the contributors in SIG-benchmarking for making performance transparency a core pillar of the llm-d project!

🚀 Check out the full demo here: youtu.be/TNYXjZpLCN4

#AI #Kubernetes #Benchmarking
Community Demo: Verified & Reproducible LLM Benchmarks | llm-d Project
In the llm-d open-source project, we believe a supported guide is only as good as the data backing it. In this community demo, the SIG-benchmarking team showcases the benchmarking suite that brings…
youtu.be
January 19, 2026 at 8:13 PM
⚫ 100% Reproducibility: We aim for a world where if you see a benchmark in an llm-d blog post, you can run the exact same template on your cluster and see the same results. Transparency is key to scaling AI.
January 19, 2026 at 8:13 PM
Why does this matter for the community?

⚫ Verified, Not Just Documented: Every community-tested guide is now backed by standardized benchmarking templates.

If the guide says it performs, we provide the tools to prove it.
January 19, 2026 at 8:13 PM
This new contribution allows anyone to benchmark a pre-existing or pre-installed stack. It is specifically designed for stacks deployed via official llm-d guides to ensure your setup matches our verified community baselines.
January 19, 2026 at 8:13 PM
In our latest community demo, the SIG-benchmarking team showcases their benchmarking suite that brings verified performance standards directly to your local environment. No more guessing if your stack is optimized.
January 19, 2026 at 8:13 PM
Reposted by llm-d
If you see me around the hallway or at the sessions, I’d love to chat about:
- Model inference (KServe, vLLM, @llm-d.ai)
- @kubernetes.io AI Conformance Program
- @kubefloworg.bsky.social & @argoproj.bsky.social
- @cncf.io TAG Workloads Foundation
- Open source, cloud-native, AI infra and systems
January 15, 2026 at 5:06 PM
Check out our updated guide on leveraging tiered caching in your own cluster: llm-d.ai/docs/guide/I...

Up next: A deep dive blog on deployment patterns and scheduling behavior. Stay tuned! ⚡️
Prefix Cache Offloading - CPU | llm-d
Well-lit path for separating prefill and decode operations
llm-d.ai
January 9, 2026 at 6:45 PM
By separating memory transfer mechanisms from global scheduling logic, llm-d ensures you get the best of both: peak engine performance + optimal resource utilization across the entire fleet. 🛠️
January 9, 2026 at 6:45 PM
How we’re using it:

⚫️ Tiered-Prefix-Cache: We use the new connector to bridge GPU HBM and CPU RAM, creating a massive, multi-tier cache hierarchy.

⚫️ Intelligent Scheduling: Our scheduler now routes requests to pods where KV blocks are already warm (in GPU or CPU).
January 9, 2026 at 6:45 PM
Our mission with llm-d is building the control plane that translates these engine-level wins into cluster-wide performance.

We’ve already integrated these capabilities into our core architecture to bridge the gap between raw hardware power and distributed scale.
January 9, 2026 at 6:45 PM