← Lessons

quiz vs the machine

Gold1380

Machine Learning

The KV Cache For Transformers Revisited

Store past attention keys and values so each new token is cheap.

5 min read · core · beat Gold to climb

The repeated work problem

A transformer generates text one token at a time. To produce each new token, attention must look at every previous token. Without help, the model would recompute the attention keys and values for the whole sequence on every step, which grows quadratically.

What the cache stores

The KV cache holds the key and value vectors for every token already processed. When the next token arrives, the model computes keys and values only for that one new token and reads the rest from the cache.

Why it matters for serving

  • It turns per token cost from scanning the whole history to a single new column of work.
  • It is the reason long generations stay fast after the first token.
  • It lives in GPU memory and grows with sequence length and batch size.

The memory cost

The cache can dominate memory during serving. A long context times many concurrent requests can need more memory than the model weights themselves, which is why serving systems track and cap KV cache size carefully.

Key idea

The KV cache reuses past keys and values so each new token costs constant attention work instead of rescanning history. The price is GPU memory that grows with context and concurrency.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the KV cache let the model avoid recomputing each step?

2. What is the main cost of using a KV cache during serving?

3. Why is the first generated token slower than later ones?