The missing order
Self attention treats its input as a bag of tokens. By itself it has no sense of which token comes first. Yet word order carries meaning, since the dog bit the man differs from the man bit the dog.
Adding position information
Positional encoding injects order by adding a position dependent vector to each token embedding.
- The original transformer used fixed sinusoidal patterns of different frequencies
- Each position gets a unique signature the model can read
- Newer models often learn position vectors directly or use rotary schemes that rotate query and key vectors
How the model uses it
Because positions are added to embeddings, attention scores can depend on both content and location. The model can learn rules like attend to the previous token or focus on the start of the sentence. Relative schemes are popular because they help models handle sequences longer than those seen in training.
Key idea
Positional encoding adds order information to token embeddings so attention can use both what a token is and where it sits.