Sam Rose
banner
samwho.dev
Sam Rose
@samwho.dev
That guy who makes visual essays about software at https://samwho.dev.

Developer Educator @ ngrok.com. Want to pair on something ngrok related? Let's do it! https://cal.com/samwho/workhours

He/him.
Sadly Safari once again almost turned me into the joker.

- Doesn’t support view-transition-name: match-element; in the shadow DOM.
- Doesn’t support :host-context()

Or it does and I was holding it wrong. Either way I had to work around both, while Chrome and Firefox worked fine.
November 10, 2025 at 10:26 PM
How would “what is a x <injection>” ever get a cache hit from “what is a x” though?
November 10, 2025 at 8:48 PM
“If this key has a prompt injection you might be toast” what specifically do you mean here?
November 10, 2025 at 8:12 PM
I’m not sure I follow. The only way another person would ever use the cache is if they used the exact same prompt, which would mean it would have to already contain the injection.

If you put the weights for one prompt into the cache entry for another, the model would just spit out nonsense.
November 10, 2025 at 7:51 PM
Also the cache value is derived from the key with what is effectively a pure function (assuming the KV weight matrices don’t change, which they don’t). The only way to match against a cache entry is to use the same key. I’m not sure what route would exist to poison it even if you wanted to.
November 10, 2025 at 7:47 PM
How would that work in this context? What would you put in the cache to poison it? Remembering that we’re talking about the KV cache of the attention mechanism, which contains just matrices of floating point numbers to feed into the attention calculation. You don’t know the model weights.
November 10, 2025 at 7:45 PM
I probably will end up with something like that. I’m trying to talk about KV caching.
November 10, 2025 at 6:35 PM
I started out creating a staggered CSS-based animation for every single line to keep it all in sync.

Then I realise it's the exact same effect as just dragging a large rectangular mask over all of the lines.

I'm silly.
November 10, 2025 at 6:14 PM
The cache prefix matches on all information given to the LLM. It is an almost essential component to making LLMs as fast as they are, especially at long context lengths. Without it you’re looking at an order of magnitude slower inference.
November 10, 2025 at 2:21 PM
It sort of is like that. Calculating attention is expensive and it’s a pure function of input tokens, so it’s very cacheable. But you can’t access it at all, it’s an internal detail. It also isn’t going to leak anything as far as I can tell, I’m trying to figure that out.
November 10, 2025 at 12:05 PM
The cache values contain no information that cannot be found in the keys. It’s the keys multiplied by weight matrices you don’t have access to.

How do you get a response out of the cache?
November 10, 2025 at 11:55 AM
I bet @textfiles.com has experience with this.
November 10, 2025 at 11:52 AM
What personal or confidential information is contained in the KV cache of an LLM’s attention mechanism?
November 10, 2025 at 11:40 AM
Maybe.
November 10, 2025 at 11:17 AM
What makes you think it probably isn’t bad?
November 10, 2025 at 11:16 AM
And I appreciate that. I’m not looking for a principled explanation though, I understand why it is a Good Thing to not share caches. What I’m interested in is what attacks you specifically open up in this specific scenario by not following that principle. Sorry, I could have been clearer in the OP.
November 10, 2025 at 11:12 AM
I appreciate what you’re saying. My question wasn’t “why is it bad to share caches in general?” but “why is it bad to share the KV cache of an LLMs multi-head attention mechanism?” I’m really looking for specifics about this very niche type of caching.
November 10, 2025 at 11:09 AM
If you get the wrong values into the cache the model will almost certainly devolve into producing complete nonsense. It won’t anything, because incorrect weights (which are basically what’s getting cached) would just destroy the model’s calculations.

Also temperate has no bearing on the cache.
November 10, 2025 at 11:04 AM
Though now I’m thinking more about it, if the user feeds the conversation plus response back in maybe you could use a timing attack to figure out the prompt. 🤔
November 10, 2025 at 10:59 AM
I think you’re possibly having the same confusion about what prompt caching is that another reply had. The cache values are not responses.

bsky.app/profile/samw...
I think there’s potentially some misunderstanding around how caching in LLMs works.

What LLMs cache are matrices derived from the input. They’re keyed off subsequences of the prompt. You never see or retrieve these values, and the derivation of their values requires weights you don’t have.
November 10, 2025 at 10:59 AM
Which is why I’m wondering about timing attacks. But it’d be tricky to take advantage of in practice because I believe the providers cache in fixed size blocks. But attackers are smarter than me so I’m sure there’s something you can do.
November 10, 2025 at 10:57 AM
So the cache values are derived from the keys (pure functional mapping between the two, just very computationally heavy to generate them) and the keys are prompts. If the cache were shared you’d need to use exactly the same prompt as another company to get a hit.
November 10, 2025 at 10:57 AM
I think there’s potentially some misunderstanding around how caching in LLMs works.

What LLMs cache are matrices derived from the input. They’re keyed off subsequences of the prompt. You never see or retrieve these values, and the derivation of their values requires weights you don’t have.
November 10, 2025 at 10:57 AM