← Lessons

quiz vs the machine

Gold1380

Machine Learning

The Sliding Window Attention

Restricting each token to a local window to make attention linear in length.

5 min read · core · beat Gold to climb

Limiting reach

Full attention lets every token see every other token, which costs work that grows with the square of length. Sliding window attention restricts each token to attend only to a fixed number of nearby tokens, a window that slides along the sequence.

Linear cost

If the window has size w, each token does work proportional to w rather than to the full length. Total cost becomes linear in sequence length, so very long inputs become affordable.

Reaching far through depth

A natural worry is that a token cannot see far away. But windows stack. A token in layer one sees its window, and the tokens at its edge saw their own windows, so by layer L the effective receptive field grows roughly with depth times window size, like a stack of convolutions.

Hybrid designs

  • Some layers use a window, others use full attention.
  • Some models add a few global tokens that everyone can see.

These mixes keep long range reach while staying cheap.

Key idea

Sliding window attention lets each token attend only to a local window, making cost linear in length, while stacking layers grows the effective receptive field so distant tokens still connect through depth.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the cost of sliding window attention in sequence length?

2. How can a token influence a far away token despite a small window?