A different kind of cache
Response caching reuses whole answers. Prompt caching is finer grained. It reuses the internal work a transformer does on a shared prefix of the prompt, even when the rest of the prompt differs.
Why prefixes repeat
Many requests share a long fixed start such as a system instruction, a tool list, or a document. Processing that prefix produces KV cache entries that are identical every time, so recomputing them is pure waste.
How it helps
- The expensive prefix is processed once and its KV cache is stored.
- Later requests with the same prefix skip straight to the new part.
- This cuts both latency and cost for the first token.
The boundaries
- The match must be an exact prefix; a change near the start invalidates the cache.
- Cached prefix state takes memory, so it is kept for a limited time.
- Putting stable content first and variable content last maximizes reuse.
Key idea
Prompt caching stores the KV state of a shared prompt prefix so repeated prefixes are not recomputed. Keeping stable content at the front and variable content at the back lets many requests reuse the cached prefix and start faster.