← Lessons

quiz vs the machine

Platinum1760

Machine Learning

Positional Information

Why token vectors alone lack order and how position gets added back.

6 min read · advanced · beat Platinum to climb

Attention is order blind

The self attention mechanism treats its inputs as a set. Without extra information, the model cannot tell the first token from the last, so order must be injected explicitly.

Three families of approaches

  • Absolute encodings add a position dependent vector to each token embedding, either fixed sinusoids or learned per position vectors.
  • Relative encodings bias attention by the distance between two tokens rather than their absolute slots.
  • Rotary encodings rotate the query and key vectors by an angle proportional to position, encoding relative offsets inside the attention dot product.

Why modern models prefer relative and rotary

Absolute learned positions struggle to generalize beyond the lengths seen in training. Relative and rotary schemes depend on offsets, which extrapolate more gracefully and underpin many context extension tricks.

Connection to tokens

Position is assigned per token, so tokenization decides how many position slots a given text consumes. A high fertility split uses more positions for the same meaning, pushing against length limits sooner.

Key idea

Attention ignores order, so models add positional information through absolute, relative, or rotary schemes, with relative and rotary generalizing better to longer inputs.

Check yourself

Answer to earn rating on the learn ladder.

1. Why must positional information be added to a transformer?

2. Why are relative and rotary encodings often preferred over absolute learned ones?

3. How does tokenization interact with positions?