Positional Information

Attention is order blind

The self attention mechanism treats its inputs as a set. Without extra information, the model cannot tell the first token from the last, so order must be injected explicitly.

Three families of approaches

Absolute encodings add a position dependent vector to each token embedding, either fixed sinusoids or learned per position vectors.
Relative encodings bias attention by the distance between two tokens rather than their absolute slots.
Rotary encodings rotate the query and key vectors by an angle proportional to position, encoding relative offsets inside the attention dot product.

Why modern models prefer relative and rotary

Absolute learned positions struggle to generalize beyond the lengths seen in training. Relative and rotary schemes depend on offsets, which extrapolate more gracefully and underpin many context extension tricks.

Connection to tokens

Position is assigned per token, so tokenization decides how many position slots a given text consumes. A high fertility split uses more positions for the same meaning, pushing against length limits sooner.

Key idea

Attention ignores order, so models add positional information through absolute, relative, or rotary schemes, with relative and rotary generalizing better to longer inputs.

Positional Information

Attention is order blind

Three families of approaches

Why modern models prefer relative and rotary

Connection to tokens

Key idea

Check yourself