What the cache holds
During autoregressive generation each new token attends to all previous tokens. To avoid recomputing, the model stores each past token key and value in the kv cache. The cache grows with every generated token and can dominate memory.
Why it is the bottleneck
Each decoding step produces one token but must read the entire cache. So generation is limited by memory bandwidth, not raw compute. Shrinking the cache or reading it more efficiently directly speeds serving.
Ways to shrink it
- Fewer key value heads through grouped or multi query attention.
- Lower precision by storing keys and values in compressed or quantized form.
- Eviction that drops tokens unlikely to matter, while keeping sinks.
Paged management
Serving systems treat the cache like virtual memory, splitting it into fixed pages so many requests share memory without fragmentation. This lets a server pack more concurrent users and reuse shared prefixes across requests.
The tradeoffs
Aggressive shrinking risks dropping a token that later proves important, so designs balance memory savings against quality, often keeping recent tokens and sinks at full fidelity.
Key idea
The kv cache stores past keys and values so generation is bandwidth bound, and optimizing it through fewer heads, lower precision, eviction, and paged sharing shrinks memory and speeds serving while balancing the risk of dropping a needed token.