Positional Encoding

The missing order

Self attention treats its input as a bag of tokens. By itself it has no sense of which token comes first. Yet word order carries meaning, since the dog bit the man differs from the man bit the dog.

Adding position information

Positional encoding injects order by adding a position dependent vector to each token embedding.

The original transformer used fixed sinusoidal patterns of different frequencies
Each position gets a unique signature the model can read
Newer models often learn position vectors directly or use rotary schemes that rotate query and key vectors

How the model uses it

Because positions are added to embeddings, attention scores can depend on both content and location. The model can learn rules like attend to the previous token or focus on the start of the sentence. Relative schemes are popular because they help models handle sequences longer than those seen in training.

Key idea

Positional encoding adds order information to token embeddings so attention can use both what a token is and where it sits.

Positional Encoding

The missing order

Adding position information

How the model uses it

Key idea

Check yourself