Throwing away most links
Full attention computes every pairwise link, most of which carry little signal. Sparse attention keeps only a chosen subset of links, cutting cost while trying to preserve the connections that matter.
Common building blocks
- Local links connect each token to its neighbors, capturing nearby structure.
- Strided or dilated links connect to tokens at fixed gaps, reaching farther with few edges.
- Global tokens are special positions that attend to and are attended by everyone, acting as hubs for long range mixing.
Why these patterns work
Local plus a few global hubs lets information travel a long way in a couple of hops, similar to small world networks. The model approximates dense attention with a fraction of the connections, so cost can fall toward linear in length.
The tradeoff
The pattern is fixed in advance, so it may miss a link the data actually needs. Designs combine several patterns to cover both local detail and global context, which is why long document transformers mix windows, dilation, and global tokens.
Key idea
Sparse attention keeps only chosen links such as local neighbors, strided gaps, and a few global hub tokens, approximating dense attention at near linear cost while risking that a fixed pattern omits a connection the data needs.