The Pairwise Comparison Eval

Why asking which of two answers is better beats absolute scoring for model quality.

Comparing instead of scoring

Asking a rater to assign an absolute score is hard because scales drift between people and over time. Pairwise comparison sidesteps this by showing two answers to the same prompt and asking which is better. Relative judgments are more stable and consistent.

From votes to rankings

Each comparison is a small contest. Collecting many of them across models lets you build a ranking. Systems often use an Elo style rating, where each win nudges a model up and each loss nudges it down, producing a leaderboard from a stream of head to head votes.

Strengths

Lower cognitive load, choosing the better of two is easier than scoring one.
Calibration free, no shared numeric scale is required.
Sensitive, it can separate models that absolute scores rate as tied.

Pitfalls

Ties and near ties need handling, often as a draw or a discarded vote. The judge, human or model, can still suffer position and verbosity bias, so randomize order. Many comparisons are needed for stable ratings, and a non transitive set of votes, where A beats B beats C beats A, signals noisy or inconsistent judging.

Key idea

Pairwise comparison replaces fragile absolute scores with stable relative judgments that aggregate into Elo style rankings, at the cost of needing many randomized comparisons and careful handling of ties and bias.

The Pairwise Comparison Eval

Comparing instead of scoring

From votes to rankings

Strengths

Pitfalls

Key idea

Check yourself