Sharpness of the distribution
The softmax that converts attention scores to weights has an implicit temperature. Dividing scores by a larger number before softmax flattens the weights, while a smaller divisor sharpens them toward the top score.
Where temperature already lives
The scaling by the square root of the key dimension is exactly a temperature choice. It is tuned so that at initialization the attention distribution is neither uniform nor a near one hot spike.
Effects of getting it wrong
- Too low a temperature gives spiky, near one hot attention, which can starve gradients and ignore useful context.
- Too high a temperature gives near uniform attention, which blurs everything and loses focus.
- The right range keeps a meaningful gradient signal across many keys.
A tuning lever
Some architectures add a learnable temperature per head so the model can decide how focused each head should be. This lets specialized heads sharpen onto a single token while broad heads pool over many.
Key idea
Attention softmax has a temperature, set by the scaling factor, that controls how sharp the weights are, and keeping it in a balanced range avoids both spiky attention that starves gradients and blurry attention that loses focus.