Multi Head Attention

Why one head is not enough

A single attention head produces one set of weights, so it can only emphasize one kind of relationship at a time. Language has many relationships at once, such as syntax, coreference, and topic.

The parallel trick

Multi head attention runs several attention computations in parallel, each with its own learned query key and value projections.

Each head looks at the sequence through a different lens
One head might track subject and verb agreement
Another might link a word to nearby modifiers
The model splits the embedding into smaller pieces so the total cost stays similar

After each head produces its output, the results are concatenated and passed through a final linear layer that mixes them back together.

The payoff

Because the heads are independent, they can specialize. The combined output is far more expressive than any single head, and the whole thing still runs efficiently on hardware that likes parallel work.

Key idea

Multi head attention runs several attention heads in parallel so the model can capture many relationship types at once, then combines them.

Multi Head Attention

Why one head is not enough

The parallel trick

The payoff

Key idea

Check yourself