The cost of naive generation
A transformer generates text one token at a time. At each step, attention compares the new token against every previous token using keys and values derived from each. Recomputing those keys and values for the whole history at every step would be hugely wasteful.
What the cache stores
The KV cache saves the key and value vectors for every past token. When a new token arrives, the model computes only its own query, key, and value, then attends over the cached keys and values plus the new one.
- Step cost drops from growing with history to a constant amount of new work
- The cache grows by one entry per layer and head for each generated token
- Generation becomes far faster, especially for long outputs
The memory trade off
The cache trades compute for memory. Its size scales with the number of layers, heads, the head dimension, and the sequence length, multiplied by the batch. For long contexts this memory can dominate and limit how many requests a server handles at once.
Techniques to shrink it
- Multi query and grouped query attention share keys and values across heads to cut cache size
- Quantizing the cache stores it in lower precision
- Eviction drops less important past entries
Key idea
The KV cache stores past keys and values so each generation step does constant new work, trading memory for speed.