The claim versus the use
A large context window means a model can read many tokens, but reading is not the same as using. Long context evaluation tests whether the model can retrieve and reason over information buried deep inside a long input.
Core probes
- Needle in a haystack, hiding a fact at varying depths and asking the model to recover it.
- Multi needle retrieval, requiring several scattered facts at once.
- Aggregation, demanding reasoning that combines distant parts of the text.
Plotting accuracy against fact position reveals where attention weakens.
The lost in the middle effect
Models often recall information at the beginning and end of the context well but degrade in the middle. A single average score hides this, so long context evals report accuracy as a function of depth, exposing the dip.
Beyond retrieval
Finding a fact is the easy case. Harder evals require synthesis across many positions, tracking entities through a long document, or answering only after combining scattered evidence. Performance usually drops as both context length and the number of required facts grow, so report results across both axes rather than a single headline number.
Key idea
Long context evaluation measures whether a model uses, not just ingests, a long input by probing retrieval and reasoning at varying depths, exposing the lost in the middle dip that a single average score conceals.