What they test
Reasoning benchmarks probe multi step problem solving: grade school and competition math, logic puzzles, and questions that need chained inference. Unlike recall tasks, the answer requires working through intermediate steps, not just retrieving a fact.
Scoring the answer and the path
- Final answer accuracy checks the boxed result.
- Step or process scoring checks whether the intermediate reasoning is valid.
- Chain of thought prompting often raises scores by giving the model room to think.
A model can reach the right answer through flawed steps, so process aware scoring catches lucky guesses.
Where they break
Reasoning benchmarks are fragile in revealing ways:
- Small changes in numbers or wording can sharply drop scores, hinting at pattern matching over genuine reasoning.
- Multiple choice formats let a model guess without reasoning.
- Popular sets leak into training data, inflating results.
Reading the results
Treat a single reasoning score with suspicion. Look for robustness across reworded variants, gains that survive when shortcuts are removed, and consistency between the stated steps and the final answer. Stable improvement across many fresh, perturbed problems is far more convincing than one headline number.
Key idea
Reasoning benchmarks measure multi step inference, but because models can pattern match, guess, or memorize, trustworthy results require process aware scoring and robustness across perturbed, uncontaminated variants.