The Softmax Temperature In Attention

Sharpness of the distribution

The softmax that converts attention scores to weights has an implicit temperature. Dividing scores by a larger number before softmax flattens the weights, while a smaller divisor sharpens them toward the top score.

Where temperature already lives

The scaling by the square root of the key dimension is exactly a temperature choice. It is tuned so that at initialization the attention distribution is neither uniform nor a near one hot spike.

Effects of getting it wrong

Too low a temperature gives spiky, near one hot attention, which can starve gradients and ignore useful context.
Too high a temperature gives near uniform attention, which blurs everything and loses focus.
The right range keeps a meaningful gradient signal across many keys.

A tuning lever

Some architectures add a learnable temperature per head so the model can decide how focused each head should be. This lets specialized heads sharpen onto a single token while broad heads pool over many.

Key idea

Attention softmax has a temperature, set by the scaling factor, that controls how sharp the weights are, and keeping it in a balanced range avoids both spiky attention that starves gradients and blurry attention that loses focus.

The Softmax Temperature In Attention

Sharpness of the distribution

Where temperature already lives

Effects of getting it wrong

A tuning lever

Key idea

Check yourself