Attention is order blind
Pure attention treats a sequence as a set, the math is the same if you shuffle the tokens. To model language we must inject position information so the model knows the order.
The sinusoidal trick
The original transformer adds a fixed sinusoidal signal to each token embedding. Each dimension is a sine or cosine wave, and different dimensions use different frequencies, from very fast to very slow.
Why sinusoids
- The pattern is deterministic, needing no learned parameters.
- Different frequencies let the model read both fine and coarse position.
- Because of trigonometric identities, a fixed offset between positions corresponds to a linear transform, so the model can learn relative distances.
Added, not concatenated
The encoding is summed into the embedding so it shares the same dimensions. The model learns to read the positional component from the combined vector during training.
Key idea
Attention is order blind, so sinusoidal encodings of many frequencies are added to embeddings, giving the model absolute position and an easy way to reason about relative distance without any learned position parameters.