The Sparse Attention Patterns

Throwing away most links

Full attention computes every pairwise link, most of which carry little signal. Sparse attention keeps only a chosen subset of links, cutting cost while trying to preserve the connections that matter.

Common building blocks

Local links connect each token to its neighbors, capturing nearby structure.
Strided or dilated links connect to tokens at fixed gaps, reaching farther with few edges.
Global tokens are special positions that attend to and are attended by everyone, acting as hubs for long range mixing.

Why these patterns work

Local plus a few global hubs lets information travel a long way in a couple of hops, similar to small world networks. The model approximates dense attention with a fraction of the connections, so cost can fall toward linear in length.

The tradeoff

The pattern is fixed in advance, so it may miss a link the data actually needs. Designs combine several patterns to cover both local detail and global context, which is why long document transformers mix windows, dilation, and global tokens.

Key idea

Sparse attention keeps only chosen links such as local neighbors, strided gaps, and a few global hub tokens, approximating dense attention at near linear cost while risking that a fixed pattern omits a connection the data needs.

The Sparse Attention Patterns

Throwing away most links

Common building blocks

Why these patterns work

The tradeoff

Key idea

Check yourself