The Multi Query Attention

The extreme sharing

Multi query attention is the most aggressive form of head sharing. The model keeps all of its query heads but uses just one key head and one value head for the whole layer.

Why it speeds inference

The kv cache stores keys and values per head. By collapsing to a single key and value head, the cache becomes much smaller. During autoregressive decoding the model is bottlenecked by reading that cache from memory, so a smaller cache means far higher throughput.

Many query heads keep expressive querying.
One shared key and value head minimizes memory traffic.

The cost

With only one key and value head, the model has less room to represent different content per head, so quality can dip more than with grouped query attention. For many tasks the speed win is worth it.

When to choose it

Multi query attention shines when decoding speed and memory matter most, such as serving long generations at scale. Grouped query attention is preferred when you want most of the speed with less quality loss.

Key idea

Multi query attention shares one key and value head among all query heads, giving the smallest kv cache and fastest decoding at the price of some quality, the extreme end that grouped query attention softens.

The Multi Query Attention

The extreme sharing

Why it speeds inference

The cost

When to choose it

Key idea

Check yourself