The Scaled Dot Product

The scoring formula

Attention scores its options by taking the dot product between a query and each key. A large dot product means the query and key point in a similar direction, so that token gets a high weight after softmax.

The scaling factor

Before softmax, each score is divided by the square root of the key dimension. Why? When vectors are long, dot products grow with dimension and their variance increases. Without scaling, some scores become very large.

What large scores do to softmax

Softmax exaggerates differences. If one score dwarfs the rest, softmax outputs a nearly one hot distribution, attention collapses onto a single token, and the gradient flowing back becomes tiny. That makes training slow and unstable.

Divide by the square root to keep score variance near one.
Softmax then produces a smoother, trainable distribution.

The full operation

So scaled dot product attention is: score with dot products, divide by the square root of the dimension, optionally add a mask, apply softmax, and weight the values.

Key idea

Scaling dot product scores by the square root of the key dimension keeps their variance controlled so softmax stays smooth and gradients flow, preventing attention from collapsing onto one token early in training.

The Scaled Dot Product

The scoring formula

The scaling factor

What large scores do to softmax

The full operation

Key idea

Check yourself