The Alibi Position Bias

Position as a penalty

Attention with linear biases, often called alibi, adds no position vectors at all. Instead it adds a bias to each attention score that grows more negative the farther apart two tokens are.

How the bias works

Before softmax, the score between a query and a key is reduced by an amount proportional to their distance. Each head gets its own slope, so:

Heads with a steep slope focus sharply on nearby tokens.
Heads with a gentle slope keep reaching far away.

This gives a built in preference for recency while still allowing long range attention in some heads.

Why it extrapolates

Because the bias is just a function of distance, it keeps working for distances longer than those seen in training. A model trained on short sequences can run on longer ones with graceful degradation, which made alibi popular for context extension.

Contrast with rotary

Alibi adds a scalar penalty to scores, while rotary rotates the vectors. Both encode relative position, but alibi is simpler and extrapolates naturally by design.

Key idea