Positional Encodings Sinusoidal

Attention is order blind

Pure attention treats a sequence as a set, the math is the same if you shuffle the tokens. To model language we must inject position information so the model knows the order.

The sinusoidal trick

The original transformer adds a fixed sinusoidal signal to each token embedding. Each dimension is a sine or cosine wave, and different dimensions use different frequencies, from very fast to very slow.

Why sinusoids

The pattern is deterministic, needing no learned parameters.
Different frequencies let the model read both fine and coarse position.
Because of trigonometric identities, a fixed offset between positions corresponds to a linear transform, so the model can learn relative distances.

Added, not concatenated

The encoding is summed into the embedding so it shares the same dimensions. The model learns to read the positional component from the combined vector during training.

Key idea

Attention is order blind, so sinusoidal encodings of many frequencies are added to embeddings, giving the model absolute position and an easy way to reason about relative distance without any learned position parameters.

Positional Encodings Sinusoidal

Attention is order blind

The sinusoidal trick

Why sinusoids

Added, not concatenated

Key idea

Check yourself