← Lessons

quiz vs the machine

Platinum1760

Machine Learning

The Softmax Temperature In Attention

How a single scaling factor sharpens or smooths where a model looks.

5 min read · advanced · beat Platinum to climb

Sharpness of the distribution

The softmax that converts attention scores to weights has an implicit temperature. Dividing scores by a larger number before softmax flattens the weights, while a smaller divisor sharpens them toward the top score.

Where temperature already lives

The scaling by the square root of the key dimension is exactly a temperature choice. It is tuned so that at initialization the attention distribution is neither uniform nor a near one hot spike.

Effects of getting it wrong

  • Too low a temperature gives spiky, near one hot attention, which can starve gradients and ignore useful context.
  • Too high a temperature gives near uniform attention, which blurs everything and loses focus.
  • The right range keeps a meaningful gradient signal across many keys.

A tuning lever

Some architectures add a learnable temperature per head so the model can decide how focused each head should be. This lets specialized heads sharpen onto a single token while broad heads pool over many.

Key idea

Attention softmax has a temperature, set by the scaling factor, that controls how sharp the weights are, and keeping it in a balanced range avoids both spiky attention that starves gradients and blurry attention that loses focus.

Check yourself

Answer to earn rating on the learn ladder.

1. What happens to attention weights at a very low softmax temperature?

2. Which standard part of attention already acts as a temperature?