The scoring formula
Attention scores its options by taking the dot product between a query and each key. A large dot product means the query and key point in a similar direction, so that token gets a high weight after softmax.
The scaling factor
Before softmax, each score is divided by the square root of the key dimension. Why? When vectors are long, dot products grow with dimension and their variance increases. Without scaling, some scores become very large.
What large scores do to softmax
Softmax exaggerates differences. If one score dwarfs the rest, softmax outputs a nearly one hot distribution, attention collapses onto a single token, and the gradient flowing back becomes tiny. That makes training slow and unstable.
- Divide by the square root to keep score variance near one.
- Softmax then produces a smoother, trainable distribution.
The full operation
So scaled dot product attention is: score with dot products, divide by the square root of the dimension, optionally add a mask, apply softmax, and weight the values.
Key idea
Scaling dot product scores by the square root of the key dimension keeps their variance controlled so softmax stays smooth and gradients flow, preventing attention from collapsing onto one token early in training.