Attention is order blind
The self attention mechanism treats its inputs as a set. Without extra information, the model cannot tell the first token from the last, so order must be injected explicitly.
Three families of approaches
- Absolute encodings add a position dependent vector to each token embedding, either fixed sinusoids or learned per position vectors.
- Relative encodings bias attention by the distance between two tokens rather than their absolute slots.
- Rotary encodings rotate the query and key vectors by an angle proportional to position, encoding relative offsets inside the attention dot product.
Why modern models prefer relative and rotary
Absolute learned positions struggle to generalize beyond the lengths seen in training. Relative and rotary schemes depend on offsets, which extrapolate more gracefully and underpin many context extension tricks.
Connection to tokens
Position is assigned per token, so tokenization decides how many position slots a given text consumes. A high fertility split uses more positions for the same meaning, pushing against length limits sooner.
Key idea
Attention ignores order, so models add positional information through absolute, relative, or rotary schemes, with relative and rotary generalizing better to longer inputs.