The Rotary Embeddings Deep

Position without adding vectors

Many models add a position vector to each token. Rotary position embeddings instead encode position by rotating the query and key vectors before the dot product, so position lives in the angle rather than in an added term.

How the rotation works

The query and key are split into pairs of coordinates. Each pair is rotated by an angle proportional to the token position, with each pair using a different frequency. Low frequency pairs rotate slowly and capture long range position, high frequency pairs rotate fast and capture fine local distinctions.

The relative property

When you dot a rotated query at position m with a rotated key at position n, the result depends on the difference m minus n, not the absolute positions. So rotary embeddings give attention a built in sense of relative distance, which helps generalization.

Rotation is applied to queries and keys, not values.
The dot product naturally encodes how far apart tokens are.

Why it scales

Because position enters as a rotation, you can push to longer contexts by adjusting the frequencies, which underlies many context extension tricks.

Key idea

Rotary embeddings encode position by rotating query and key pairs at varying frequencies, so their dot product depends on relative distance rather than absolute position, giving clean relative encoding and a path to longer contexts.

The Rotary Embeddings Deep

Position without adding vectors

How the rotation works

The relative property

Why it scales

Key idea

Check yourself