The long context goal
Many uses want a model to read whole books or codebases at once. But attention cost grows with the square of length and the kv cache grows with length, so naively long contexts are expensive. Several techniques make them practical.
Efficient attention
- Sliding window and sparse patterns cut attention cost toward linear.
- Flash attention keeps full attention exact while slashing memory traffic.
Position extension
Models trained on short sequences often fail on longer ones. Tricks adjust position encoding to stretch reach:
- Scaling rotary frequencies so positions seen in training cover longer ranges.
- Alibi style distance biases that extrapolate by design.
Cache and retrieval
- Cache compression and eviction keep memory bounded while preserving sinks and recent tokens.
- Retrieval fetches only the most relevant passages so the model attends to a small slice instead of everything.
Putting it together
Real long context systems combine several of these. Efficient attention controls compute, position tricks keep the model coherent, and cache plus retrieval control memory, together pushing usable context far beyond the original training length.
Key idea
Long context techniques combine efficient attention to control compute, position extension to keep the model coherent past training length, and cache compression plus retrieval to bound memory, together stretching usable context far beyond the original window.