The KV Cache in Transformers

Why generating tokens one at a time stores past keys and values to avoid recomputation.

Why a cache exists

A transformer generates text one token at a time. To produce the next token, attention compares the current token against every previous token using keys and values. Without a cache, the model would recompute the keys and values for all earlier tokens at every step, repeating the same work again and again.

What gets stored

The KV cache saves the key and value vectors for tokens already processed. At each new step:

Compute key and value for only the new token.
Append them to the cache.
Run attention using the full cache of past keys and values.

This turns the cost of each step from quadratic in sequence length into roughly linear, because old work is reused.

The memory cost

The cache grows with sequence length, batch size, number of layers, and number of heads. For long contexts it can dominate GPU memory. Its size is often the real limit on how many requests a server can run at once, which motivates tricks like grouped query attention and paged attention.

Key idea

The KV cache trades memory for speed by storing past keys and values so each generated token reuses earlier work instead of recomputing it.

The KV Cache in Transformers

Why a cache exists

What gets stored

The memory cost

Key idea

Check yourself