← Lessons

quiz vs the machine

Gold1480

Machine Learning

The Alibi Position Bias

Biasing attention scores by distance to extrapolate to longer sequences.

5 min read · core · beat Gold to climb

Position as a penalty

Attention with linear biases, often called alibi, adds no position vectors at all. Instead it adds a bias to each attention score that grows more negative the farther apart two tokens are.

How the bias works

Before softmax, the score between a query and a key is reduced by an amount proportional to their distance. Each head gets its own slope, so:

  • Heads with a steep slope focus sharply on nearby tokens.
  • Heads with a gentle slope keep reaching far away.

This gives a built in preference for recency while still allowing long range attention in some heads.

Why it extrapolates

Because the bias is just a function of distance, it keeps working for distances longer than those seen in training. A model trained on short sequences can run on longer ones with graceful degradation, which made alibi popular for context extension.

Contrast with rotary

Alibi adds a scalar penalty to scores, while rotary rotates the vectors. Both encode relative position, but alibi is simpler and extrapolates naturally by design.

Key idea

Alibi adds a distance proportional penalty to attention scores with a per head slope, building in a recency bias while letting some heads reach far, and because the penalty is a function of distance it extrapolates gracefully to longer sequences.

Check yourself

Answer to earn rating on the learn ladder.

1. How does alibi encode position?

2. Why is alibi good at length extrapolation?