The Long Context Eval

The claim versus the use

A large context window means a model can read many tokens, but reading is not the same as using. Long context evaluation tests whether the model can retrieve and reason over information buried deep inside a long input.

Core probes

Needle in a haystack, hiding a fact at varying depths and asking the model to recover it.
Multi needle retrieval, requiring several scattered facts at once.
Aggregation, demanding reasoning that combines distant parts of the text.

Plotting accuracy against fact position reveals where attention weakens.

The lost in the middle effect

Models often recall information at the beginning and end of the context well but degrade in the middle. A single average score hides this, so long context evals report accuracy as a function of depth, exposing the dip.

Beyond retrieval

Finding a fact is the easy case. Harder evals require synthesis across many positions, tracking entities through a long document, or answering only after combining scattered evidence. Performance usually drops as both context length and the number of required facts grow, so report results across both axes rather than a single headline number.

Key idea

Long context evaluation measures whether a model uses, not just ingests, a long input by probing retrieval and reasoning at varying depths, exposing the lost in the middle dip that a single average score conceals.

The Long Context Eval

The claim versus the use

Core probes

The lost in the middle effect

Beyond retrieval

Key idea

Check yourself