← Lessons

quiz vs the machine

Platinum1780

Machine Learning

The KV Cache

Reusing past keys and values so transformer generation does not redo work.

5 min read · advanced · beat Platinum to climb

The cost of naive generation

A transformer generates text one token at a time. At each step, attention compares the new token against every previous token using keys and values derived from each. Recomputing those keys and values for the whole history at every step would be hugely wasteful.

What the cache stores

The KV cache saves the key and value vectors for every past token. When a new token arrives, the model computes only its own query, key, and value, then attends over the cached keys and values plus the new one.

  • Step cost drops from growing with history to a constant amount of new work
  • The cache grows by one entry per layer and head for each generated token
  • Generation becomes far faster, especially for long outputs

The memory trade off

The cache trades compute for memory. Its size scales with the number of layers, heads, the head dimension, and the sequence length, multiplied by the batch. For long contexts this memory can dominate and limit how many requests a server handles at once.

Techniques to shrink it

  • Multi query and grouped query attention share keys and values across heads to cut cache size
  • Quantizing the cache stores it in lower precision
  • Eviction drops less important past entries

Key idea

The KV cache stores past keys and values so each generation step does constant new work, trading memory for speed.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the KV cache store during generation?

2. What trade off does the KV cache make?

3. How do grouped query and multi query attention help the cache?