← Lessons

quiz vs the machine

Silver1110

Machine Learning

Scaled Dot Product Attention

The core operation that turns similarity scores into a weighted blend.

5 min read · intro · beat Silver to climb

The mechanism

Scaled dot product attention is the heart of the transformer. Each query vector is compared to every key vector by a dot product, giving a similarity score. Those scores are scaled, passed through softmax to become weights, and used to average the value vectors.

The four steps

  • Compute scores as queries times keys transposed.
  • Divide each score by the square root of the key dimension.
  • Apply softmax along each row so weights sum to one.
  • Multiply the weights by the values to get the output.

Why the scaling

When the key dimension is large, dot products grow large in magnitude. Feeding huge numbers into softmax pushes it into a sharp regime where gradients nearly vanish. Dividing by the square root of the dimension keeps the scores in a sane range so learning stays smooth.

Reading the output

Each output vector is a convex combination of value vectors, weighted by how relevant each key was to that query. A token effectively gathers information from the tokens it finds most similar.

Key idea

Attention scores queries against keys, scales by the square root of the dimension, softmaxes into weights, and averages the values, so each token pulls in a relevance weighted blend of the others.

Check yourself

Answer to earn rating on the learn ladder.

1. Why are attention scores divided by the square root of the key dimension?

2. What does the final step of scaled dot product attention produce?

3. What turns the raw scores into weights that sum to one?