The mechanism
Scaled dot product attention is the heart of the transformer. Each query vector is compared to every key vector by a dot product, giving a similarity score. Those scores are scaled, passed through softmax to become weights, and used to average the value vectors.
The four steps
- Compute scores as queries times keys transposed.
- Divide each score by the square root of the key dimension.
- Apply softmax along each row so weights sum to one.
- Multiply the weights by the values to get the output.
Why the scaling
When the key dimension is large, dot products grow large in magnitude. Feeding huge numbers into softmax pushes it into a sharp regime where gradients nearly vanish. Dividing by the square root of the dimension keeps the scores in a sane range so learning stays smooth.
Reading the output
Each output vector is a convex combination of value vectors, weighted by how relevant each key was to that query. A token effectively gathers information from the tokens it finds most similar.
Key idea
Attention scores queries against keys, scales by the square root of the dimension, softmaxes into weights, and averages the values, so each token pulls in a relevance weighted blend of the others.