Position without adding vectors
Many models add a position vector to each token. Rotary position embeddings instead encode position by rotating the query and key vectors before the dot product, so position lives in the angle rather than in an added term.
How the rotation works
The query and key are split into pairs of coordinates. Each pair is rotated by an angle proportional to the token position, with each pair using a different frequency. Low frequency pairs rotate slowly and capture long range position, high frequency pairs rotate fast and capture fine local distinctions.
The relative property
When you dot a rotated query at position m with a rotated key at position n, the result depends on the difference m minus n, not the absolute positions. So rotary embeddings give attention a built in sense of relative distance, which helps generalization.
- Rotation is applied to queries and keys, not values.
- The dot product naturally encodes how far apart tokens are.
Why it scales
Because position enters as a rotation, you can push to longer contexts by adjusting the frequencies, which underlies many context extension tricks.
Key idea
Rotary embeddings encode position by rotating query and key pairs at varying frequencies, so their dot product depends on relative distance rather than absolute position, giving clean relative encoding and a path to longer contexts.