← Lessons

quiz vs the machine

Gold1360

Machine Learning

Multi Head Attention Revisited

Why one attention pattern is never enough.

5 min read · core · beat Gold to climb

Many attentions in parallel

A single attention computes one set of weights, so it can capture one kind of relationship at a time. Multi head attention runs several attention operations in parallel, each with its own query, key, and value projections, then concatenates their outputs.

How it splits the work

  • The model dimension is divided into heads, each of lower dimension.
  • Each head computes its own scaled dot product attention independently.
  • Outputs are concatenated and passed through a final output projection.

Why multiple heads help

Different heads can specialize. One head might track syntactic agreement, another might link a word to a distant antecedent, another might attend to the previous token. Running them in parallel lets the block model several relationship types at once without interference.

The cost balance

Splitting the dimension keeps the total compute close to a single full width attention, so heads are nearly free. The concatenation and output projection then mix the heads back into a single representation.

Key idea

Multi head attention runs several lower dimensional attentions in parallel so the model captures many relationship types at once, then concatenates and projects them back together.

Check yourself

Answer to earn rating on the learn ladder.

1. Why does multi head attention use several heads instead of one?

2. What happens to the head outputs after each head attends?

3. How is total compute kept manageable across many heads?