← Lessons

quiz vs the machine

Gold1450

Machine Learning

The Sparse Attention Patterns

Hand designed connectivity that keeps a few useful links instead of all.

5 min read · core · beat Gold to climb

Throwing away most links

Full attention computes every pairwise link, most of which carry little signal. Sparse attention keeps only a chosen subset of links, cutting cost while trying to preserve the connections that matter.

Common building blocks

  • Local links connect each token to its neighbors, capturing nearby structure.
  • Strided or dilated links connect to tokens at fixed gaps, reaching farther with few edges.
  • Global tokens are special positions that attend to and are attended by everyone, acting as hubs for long range mixing.

Why these patterns work

Local plus a few global hubs lets information travel a long way in a couple of hops, similar to small world networks. The model approximates dense attention with a fraction of the connections, so cost can fall toward linear in length.

The tradeoff

The pattern is fixed in advance, so it may miss a link the data actually needs. Designs combine several patterns to cover both local detail and global context, which is why long document transformers mix windows, dilation, and global tokens.

Key idea

Sparse attention keeps only chosen links such as local neighbors, strided gaps, and a few global hub tokens, approximating dense attention at near linear cost while risking that a fixed pattern omits a connection the data needs.

Check yourself

Answer to earn rating on the learn ladder.

1. What role do global tokens play in sparse attention?

2. What is a risk of fixed sparse patterns?