The extreme sharing
Multi query attention is the most aggressive form of head sharing. The model keeps all of its query heads but uses just one key head and one value head for the whole layer.
Why it speeds inference
The kv cache stores keys and values per head. By collapsing to a single key and value head, the cache becomes much smaller. During autoregressive decoding the model is bottlenecked by reading that cache from memory, so a smaller cache means far higher throughput.
- Many query heads keep expressive querying.
- One shared key and value head minimizes memory traffic.
The cost
With only one key and value head, the model has less room to represent different content per head, so quality can dip more than with grouped query attention. For many tasks the speed win is worth it.
When to choose it
Multi query attention shines when decoding speed and memory matter most, such as serving long generations at scale. Grouped query attention is preferred when you want most of the speed with less quality loss.
Key idea
Multi query attention shares one key and value head among all query heads, giving the smallest kv cache and fastest decoding at the price of some quality, the extreme end that grouped query attention softens.