Limiting reach
Full attention lets every token see every other token, which costs work that grows with the square of length. Sliding window attention restricts each token to attend only to a fixed number of nearby tokens, a window that slides along the sequence.
Linear cost
If the window has size w, each token does work proportional to w rather than to the full length. Total cost becomes linear in sequence length, so very long inputs become affordable.
Reaching far through depth
A natural worry is that a token cannot see far away. But windows stack. A token in layer one sees its window, and the tokens at its edge saw their own windows, so by layer L the effective receptive field grows roughly with depth times window size, like a stack of convolutions.
Hybrid designs
- Some layers use a window, others use full attention.
- Some models add a few global tokens that everyone can see.
These mixes keep long range reach while staying cheap.
Key idea
Sliding window attention lets each token attend only to a local window, making cost linear in length, while stacking layers grows the effective receptive field so distant tokens still connect through depth.