← Lessons

quiz vs the machine

Platinum1740

Machine Learning

The Long Context Eval

Testing whether a model truly uses a huge input or just skims the ends.

6 min read · advanced · beat Platinum to climb

The claim versus the use

A large context window means a model can read many tokens, but reading is not the same as using. Long context evaluation tests whether the model can retrieve and reason over information buried deep inside a long input.

Core probes

  • Needle in a haystack, hiding a fact at varying depths and asking the model to recover it.
  • Multi needle retrieval, requiring several scattered facts at once.
  • Aggregation, demanding reasoning that combines distant parts of the text.

Plotting accuracy against fact position reveals where attention weakens.

The lost in the middle effect

Models often recall information at the beginning and end of the context well but degrade in the middle. A single average score hides this, so long context evals report accuracy as a function of depth, exposing the dip.

Beyond retrieval

Finding a fact is the easy case. Harder evals require synthesis across many positions, tracking entities through a long document, or answering only after combining scattered evidence. Performance usually drops as both context length and the number of required facts grow, so report results across both axes rather than a single headline number.

Key idea

Long context evaluation measures whether a model uses, not just ingests, a long input by probing retrieval and reasoning at varying depths, exposing the lost in the middle dip that a single average score conceals.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the needle in a haystack probe test?

2. What is the lost in the middle effect?

3. Why should long context results be reported across depth and fact count?