← Lessons

quiz vs the machine

Gold1480

Machine Learning

Sinusoidal Positional Encoding

How transformers inject word order into a set based attention layer.

5 min read · core · beat Gold to climb

The problem

Self attention treats its inputs as an unordered set. If you shuffle the tokens, raw attention gives the same result. But language depends on order, so transformers add a positional encoding to each token embedding before the first attention layer.

The sinusoidal trick

The original transformer uses fixed sine and cosine waves of many different frequencies. Each position gets a unique vector, and each dimension oscillates at its own rate:

  • Low dimensions use high frequency waves that change quickly from token to token
  • High dimensions use low frequency waves that change slowly across the whole sequence
  • The combination gives every position a distinct fingerprint the model can read

A neat property is that the encoding for a position can be written as a linear function of another position, so the network can learn to attend by relative offset, not just absolute index.

Alternatives

Modern models often swap this for learned position embeddings or rotary encodings, but the goal is the same: give order back to an order blind layer.

Key idea

Sinusoidal positional encoding adds fixed multi frequency waves to token embeddings so a set based attention layer can reason about sequence order.

Check yourself

Answer to earn rating on the learn ladder.

1. Why does attention need positional encoding?

2. What varies across the dimensions of a sinusoidal encoding?