← Lessons

quiz vs the machine

Gold1480

Machine Learning

Multi Head Attention

Running several attention patterns in parallel to capture richer relations.

5 min read · core · beat Gold to climb

Why one head is not enough

A single attention head produces one set of weights, so it can only emphasize one kind of relationship at a time. Language has many relationships at once, such as syntax, coreference, and topic.

The parallel trick

Multi head attention runs several attention computations in parallel, each with its own learned query key and value projections.

  • Each head looks at the sequence through a different lens
  • One head might track subject and verb agreement
  • Another might link a word to nearby modifiers
  • The model splits the embedding into smaller pieces so the total cost stays similar

After each head produces its output, the results are concatenated and passed through a final linear layer that mixes them back together.

The payoff

Because the heads are independent, they can specialize. The combined output is far more expressive than any single head, and the whole thing still runs efficiently on hardware that likes parallel work.

Key idea

Multi head attention runs several attention heads in parallel so the model can capture many relationship types at once, then combines them.

Check yourself

Answer to earn rating on the learn ladder.

1. Why use multiple attention heads instead of one?

2. What happens to the outputs of the heads?