Comparing instead of scoring
Asking a rater to assign an absolute score is hard because scales drift between people and over time. Pairwise comparison sidesteps this by showing two answers to the same prompt and asking which is better. Relative judgments are more stable and consistent.
From votes to rankings
Each comparison is a small contest. Collecting many of them across models lets you build a ranking. Systems often use an Elo style rating, where each win nudges a model up and each loss nudges it down, producing a leaderboard from a stream of head to head votes.
Strengths
- Lower cognitive load, choosing the better of two is easier than scoring one.
- Calibration free, no shared numeric scale is required.
- Sensitive, it can separate models that absolute scores rate as tied.
Pitfalls
Ties and near ties need handling, often as a draw or a discarded vote. The judge, human or model, can still suffer position and verbosity bias, so randomize order. Many comparisons are needed for stable ratings, and a non transitive set of votes, where A beats B beats C beats A, signals noisy or inconsistent judging.
Key idea
Pairwise comparison replaces fragile absolute scores with stable relative judgments that aggregate into Elo style rankings, at the cost of needing many randomized comparisons and careful handling of ties and bias.