A finite budget
A model can only attend to a limited number of tokens at once, called the context window. Everything you send competes for it, including the system prompt, instructions, examples, retrieved documents, the user message, and the space left for the answer.
What fills the window
- Instructions and persona from the system prompt.
- Examples if you use few shot prompting.
- Retrieved context pulled in for grounding.
- Conversation history from earlier turns.
- Reserved output room for the model reply.
Budgeting strategies
- Trim history by summarizing old turns instead of resending them.
- Rank retrieval so only the most relevant passages are included.
- Compress examples to the fewest that still teach the pattern.
- Reserve output tokens so a long answer is not truncated.
Why it matters
Overflowing the window forces truncation, which can silently drop important context and degrade answers. Models can also lose focus on content buried in the middle of a very long prompt, so placement and brevity both help.
Key idea
The context window is a finite token budget shared by every part of the prompt, so trimming history, ranking retrieval, and reserving output room keep the most useful content in view.