← Lessons

quiz vs the machine

Platinum1760

Machine Learning

The Rotary Embeddings Deep

Encoding position by rotating query and key vectors at different speeds.

6 min read · advanced · beat Platinum to climb

Position without adding vectors

Many models add a position vector to each token. Rotary position embeddings instead encode position by rotating the query and key vectors before the dot product, so position lives in the angle rather than in an added term.

How the rotation works

The query and key are split into pairs of coordinates. Each pair is rotated by an angle proportional to the token position, with each pair using a different frequency. Low frequency pairs rotate slowly and capture long range position, high frequency pairs rotate fast and capture fine local distinctions.

The relative property

When you dot a rotated query at position m with a rotated key at position n, the result depends on the difference m minus n, not the absolute positions. So rotary embeddings give attention a built in sense of relative distance, which helps generalization.

  • Rotation is applied to queries and keys, not values.
  • The dot product naturally encodes how far apart tokens are.

Why it scales

Because position enters as a rotation, you can push to longer contexts by adjusting the frequencies, which underlies many context extension tricks.

Key idea

Rotary embeddings encode position by rotating query and key pairs at varying frequencies, so their dot product depends on relative distance rather than absolute position, giving clean relative encoding and a path to longer contexts.

Check yourself

Answer to earn rating on the learn ladder.

1. What property does the rotary dot product have?

2. To which vectors is the rotation applied?