Position as a penalty
Attention with linear biases, often called alibi, adds no position vectors at all. Instead it adds a bias to each attention score that grows more negative the farther apart two tokens are.
How the bias works
Before softmax, the score between a query and a key is reduced by an amount proportional to their distance. Each head gets its own slope, so:
- Heads with a steep slope focus sharply on nearby tokens.
- Heads with a gentle slope keep reaching far away.
This gives a built in preference for recency while still allowing long range attention in some heads.
Why it extrapolates
Because the bias is just a function of distance, it keeps working for distances longer than those seen in training. A model trained on short sequences can run on longer ones with graceful degradation, which made alibi popular for context extension.
Contrast with rotary
Alibi adds a scalar penalty to scores, while rotary rotates the vectors. Both encode relative position, but alibi is simpler and extrapolates naturally by design.
Key idea
Alibi adds a distance proportional penalty to attention scores with a per head slope, building in a recency bias while letting some heads reach far, and because the penalty is a function of distance it extrapolates gracefully to longer sequences.