The repeated work problem
A transformer generates text one token at a time. To produce each new token, attention must look at every previous token. Without help, the model would recompute the attention keys and values for the whole sequence on every step, which grows quadratically.
What the cache stores
The KV cache holds the key and value vectors for every token already processed. When the next token arrives, the model computes keys and values only for that one new token and reads the rest from the cache.
Why it matters for serving
- It turns per token cost from scanning the whole history to a single new column of work.
- It is the reason long generations stay fast after the first token.
- It lives in GPU memory and grows with sequence length and batch size.
The memory cost
The cache can dominate memory during serving. A long context times many concurrent requests can need more memory than the model weights themselves, which is why serving systems track and cap KV cache size carefully.
Key idea
The KV cache reuses past keys and values so each new token costs constant attention work instead of rescanning history. The price is GPU memory that grows with context and concurrency.