Context Window and Long Context

What the context window is

The context window is the maximum number of tokens a model can take in at once, including the prompt and the text it has generated so far. Anything beyond it must be dropped, summarized, or retrieved later. The window bounds how much the model can reason over in a single pass.

Why long context is hard

Standard attention compares every token with every other token, so its cost grows with the square of the sequence length. Doubling the length roughly quadruples the attention work, and the KV cache memory grows with length too. These costs make very long windows expensive.

How models extend it

Efficient attention methods reduce the quadratic cost, for example sliding window or sparse patterns
Position encodings that extrapolate, such as rotary embeddings with scaling, let a model handle lengths beyond what it trained on
Retrieval fetches only the relevant passages so the model reads less

Quality at long range

A long window does not guarantee good use of it. Models often show a lost in the middle effect, attending well to the start and end of a long input while neglecting the middle. Evaluating long context means testing whether the model truly uses information placed deep inside, not just whether it fits.

Key idea

The context window limits how many tokens a model attends to at once, and extending it fights quadratic cost, memory, and weak use of the middle.

Context Window and Long Context

What the context window is

Why long context is hard

How models extend it

Quality at long range

Key idea

Check yourself