Sinusoidal Positional Encoding

The problem

Self attention treats its inputs as an unordered set. If you shuffle the tokens, raw attention gives the same result. But language depends on order, so transformers add a positional encoding to each token embedding before the first attention layer.

The sinusoidal trick

The original transformer uses fixed sine and cosine waves of many different frequencies. Each position gets a unique vector, and each dimension oscillates at its own rate:

Low dimensions use high frequency waves that change quickly from token to token
High dimensions use low frequency waves that change slowly across the whole sequence
The combination gives every position a distinct fingerprint the model can read

A neat property is that the encoding for a position can be written as a linear function of another position, so the network can learn to attend by relative offset, not just absolute index.

Alternatives

Modern models often swap this for learned position embeddings or rotary encodings, but the goal is the same: give order back to an order blind layer.

Key idea

Sinusoidal positional encoding adds fixed multi frequency waves to token embeddings so a set based attention layer can reason about sequence order.

Sinusoidal Positional Encoding

The problem

The sinusoidal trick

Alternatives

Key idea

Check yourself