← Lessons

quiz vs the machine

Gold1400

Machine Learning

The Multi Query Attention

All query heads share a single key and value head for fast decoding.

4 min read · core · beat Gold to climb

The extreme sharing

Multi query attention is the most aggressive form of head sharing. The model keeps all of its query heads but uses just one key head and one value head for the whole layer.

Why it speeds inference

The kv cache stores keys and values per head. By collapsing to a single key and value head, the cache becomes much smaller. During autoregressive decoding the model is bottlenecked by reading that cache from memory, so a smaller cache means far higher throughput.

  • Many query heads keep expressive querying.
  • One shared key and value head minimizes memory traffic.

The cost

With only one key and value head, the model has less room to represent different content per head, so quality can dip more than with grouped query attention. For many tasks the speed win is worth it.

When to choose it

Multi query attention shines when decoding speed and memory matter most, such as serving long generations at scale. Grouped query attention is preferred when you want most of the speed with less quality loss.

Key idea

Multi query attention shares one key and value head among all query heads, giving the smallest kv cache and fastest decoding at the price of some quality, the extreme end that grouped query attention softens.

Check yourself

Answer to earn rating on the learn ladder.

1. How many key and value heads does multi query attention use?

2. What is the main downside of multi query attention?