The window is a token budget
A model's context length is the maximum number of tokens it can attend to at once. It covers the prompt and the generated reply together, so they share one budget.
What competes for space
- The system message and instructions.
- Conversation history in a chat.
- Retrieved documents in a retrieval pipeline.
- The model's own growing output.
When the total would exceed the limit, something must be dropped, summarized, or truncated.
Why tokens not characters
Because the model operates on tokens, the limit is naturally a token count. The same character count can fit very differently depending on language and tokenizer fertility.
Practical pressure
Longer contexts also cost more and can slow attention, so filling the window is rarely free even when it fits. Good systems budget context deliberately rather than dumping everything in.
Key idea
Context length is a shared token budget for prompt plus output, and everything from history to retrieved text competes for that limited space.