Rotary Position Embeddings

Rotating queries and keys

Rotary position embeddings, or RoPE, encode position by rotating the query and key vectors by an angle that depends on their position. Pairs of dimensions are treated as points on a plane and spun by a position dependent angle.

Why rotation encodes relativity

When you take the dot product of a rotated query and a rotated key, the result depends only on the difference of their positions, not the absolute values. So attention scores naturally become a function of relative distance.

What makes it popular

Position enters through multiplication on queries and keys, not addition to embeddings.
It generalizes more gracefully to sequences longer than those seen in training.
Different frequency bands rotate at different rates, capturing near and far relationships.

Applied inside attention

RoPE is applied right before the dot product in each attention head, so values are left untouched. This keeps the content of a token separate from how its position influences matching.

Key idea

Rotary embeddings rotate queries and keys by a position dependent angle so their dot product depends only on relative distance, giving a clean relative position scheme that extends well to longer sequences.

Rotary Position Embeddings

Rotating queries and keys

Why rotation encodes relativity

What makes it popular

Applied inside attention

Key idea

Check yourself