← Lessons

quiz vs the machine

Gold1400

Machine Learning

Positional Encodings Sinusoidal

How attention learns order when it has none built in.

5 min read · core · beat Gold to climb

Attention is order blind

Pure attention treats a sequence as a set, the math is the same if you shuffle the tokens. To model language we must inject position information so the model knows the order.

The sinusoidal trick

The original transformer adds a fixed sinusoidal signal to each token embedding. Each dimension is a sine or cosine wave, and different dimensions use different frequencies, from very fast to very slow.

Why sinusoids

  • The pattern is deterministic, needing no learned parameters.
  • Different frequencies let the model read both fine and coarse position.
  • Because of trigonometric identities, a fixed offset between positions corresponds to a linear transform, so the model can learn relative distances.

Added, not concatenated

The encoding is summed into the embedding so it shares the same dimensions. The model learns to read the positional component from the combined vector during training.

Key idea

Attention is order blind, so sinusoidal encodings of many frequencies are added to embeddings, giving the model absolute position and an easy way to reason about relative distance without any learned position parameters.

Check yourself

Answer to earn rating on the learn ladder.

1. Why does a transformer need positional encodings at all?

2. What advantage do multiple sinusoid frequencies provide?

3. How is the sinusoidal encoding combined with the token embedding?