The problem
Self attention treats its inputs as an unordered set. If you shuffle the tokens, raw attention gives the same result. But language depends on order, so transformers add a positional encoding to each token embedding before the first attention layer.
The sinusoidal trick
The original transformer uses fixed sine and cosine waves of many different frequencies. Each position gets a unique vector, and each dimension oscillates at its own rate:
- Low dimensions use high frequency waves that change quickly from token to token
- High dimensions use low frequency waves that change slowly across the whole sequence
- The combination gives every position a distinct fingerprint the model can read
A neat property is that the encoding for a position can be written as a linear function of another position, so the network can learn to attend by relative offset, not just absolute index.
Alternatives
Modern models often swap this for learned position embeddings or rotary encodings, but the goal is the same: give order back to an order blind layer.
Key idea
Sinusoidal positional encoding adds fixed multi frequency waves to token embeddings so a set based attention layer can reason about sequence order.