The KV Cache

The cost of naive generation

A transformer generates text one token at a time. At each step, attention compares the new token against every previous token using keys and values derived from each. Recomputing those keys and values for the whole history at every step would be hugely wasteful.

What the cache stores

The KV cache saves the key and value vectors for every past token. When a new token arrives, the model computes only its own query, key, and value, then attends over the cached keys and values plus the new one.

Step cost drops from growing with history to a constant amount of new work
The cache grows by one entry per layer and head for each generated token
Generation becomes far faster, especially for long outputs

The memory trade off

The cache trades compute for memory. Its size scales with the number of layers, heads, the head dimension, and the sequence length, multiplied by the batch. For long contexts this memory can dominate and limit how many requests a server handles at once.

Techniques to shrink it

Multi query and grouped query attention share keys and values across heads to cut cache size
Quantizing the cache stores it in lower precision
Eviction drops less important past entries

Key idea

The KV cache stores past keys and values so each generation step does constant new work, trading memory for speed.

The cost of naive generation

What the cache stores

The memory trade off

Techniques to shrink it

Key idea

Check yourself