Rotating queries and keys
Rotary position embeddings, or RoPE, encode position by rotating the query and key vectors by an angle that depends on their position. Pairs of dimensions are treated as points on a plane and spun by a position dependent angle.
Why rotation encodes relativity
When you take the dot product of a rotated query and a rotated key, the result depends only on the difference of their positions, not the absolute values. So attention scores naturally become a function of relative distance.
What makes it popular
- Position enters through multiplication on queries and keys, not addition to embeddings.
- It generalizes more gracefully to sequences longer than those seen in training.
- Different frequency bands rotate at different rates, capturing near and far relationships.
Applied inside attention
RoPE is applied right before the dot product in each attention head, so values are left untouched. This keeps the content of a token separate from how its position influences matching.
Key idea
Rotary embeddings rotate queries and keys by a position dependent angle so their dot product depends only on relative distance, giving a clean relative position scheme that extends well to longer sequences.