← Lessons

quiz vs the machine

Silver1100

Machine Learning

The Scaled Dot Product

Why attention scores are divided by a square root before softmax.

4 min read · intro · beat Silver to climb

The scoring formula

Attention scores its options by taking the dot product between a query and each key. A large dot product means the query and key point in a similar direction, so that token gets a high weight after softmax.

The scaling factor

Before softmax, each score is divided by the square root of the key dimension. Why? When vectors are long, dot products grow with dimension and their variance increases. Without scaling, some scores become very large.

What large scores do to softmax

Softmax exaggerates differences. If one score dwarfs the rest, softmax outputs a nearly one hot distribution, attention collapses onto a single token, and the gradient flowing back becomes tiny. That makes training slow and unstable.

  • Divide by the square root to keep score variance near one.
  • Softmax then produces a smoother, trainable distribution.

The full operation

So scaled dot product attention is: score with dot products, divide by the square root of the dimension, optionally add a mask, apply softmax, and weight the values.

Key idea

Scaling dot product scores by the square root of the key dimension keeps their variance controlled so softmax stays smooth and gradients flow, preventing attention from collapsing onto one token early in training.

Check yourself

Answer to earn rating on the learn ladder.

1. Why divide attention scores by the square root of the dimension?

2. What happens if scores are too large before softmax?