What the context window is
The context window is the maximum number of tokens a model can take in at once, including the prompt and the text it has generated so far. Anything beyond it must be dropped, summarized, or retrieved later. The window bounds how much the model can reason over in a single pass.
Why long context is hard
Standard attention compares every token with every other token, so its cost grows with the square of the sequence length. Doubling the length roughly quadruples the attention work, and the KV cache memory grows with length too. These costs make very long windows expensive.
How models extend it
- Efficient attention methods reduce the quadratic cost, for example sliding window or sparse patterns
- Position encodings that extrapolate, such as rotary embeddings with scaling, let a model handle lengths beyond what it trained on
- Retrieval fetches only the relevant passages so the model reads less
Quality at long range
A long window does not guarantee good use of it. Models often show a lost in the middle effect, attending well to the start and end of a long input while neglecting the middle. Evaluating long context means testing whether the model truly uses information placed deep inside, not just whether it fits.
Key idea
The context window limits how many tokens a model attends to at once, and extending it fights quadratic cost, memory, and weak use of the middle.