Prompt Caching

A different kind of cache

Response caching reuses whole answers. Prompt caching is finer grained. It reuses the internal work a transformer does on a shared prefix of the prompt, even when the rest of the prompt differs.

Why prefixes repeat

Many requests share a long fixed start such as a system instruction, a tool list, or a document. Processing that prefix produces KV cache entries that are identical every time, so recomputing them is pure waste.

How it helps

The expensive prefix is processed once and its KV cache is stored.
Later requests with the same prefix skip straight to the new part.
This cuts both latency and cost for the first token.

The boundaries

The match must be an exact prefix; a change near the start invalidates the cache.
Cached prefix state takes memory, so it is kept for a limited time.
Putting stable content first and variable content last maximizes reuse.

Key idea

Prompt caching stores the KV state of a shared prompt prefix so repeated prefixes are not recomputed. Keeping stable content at the front and variable content at the back lets many requests reuse the cached prefix and start faster.

A different kind of cache

Why prefixes repeat

How it helps

The boundaries

Key idea

Check yourself