Calibration and the Brier Score

Making predicted probabilities mean what they say, and measuring how well they do.

Probabilities you can trust

A model that outputs 0.7 is calibrated if, among all such predictions, about 70 percent are actually positive. Calibration is separate from ranking: a model can rank perfectly yet output badly scaled probabilities.

Why it matters

Downstream decisions multiply probabilities by costs. An expected value calculation only works if the probability is honest.

Measuring calibration

A reliability diagram bins predictions and plots predicted versus observed frequency. The diagonal is perfect
Expected calibration error averages the gap across bins
The Brier score is the mean squared error between predicted probability and the 0 or 1 outcome, lower is better

The Brier score rewards both calibration and sharpness, so it captures overall probabilistic quality in one number.

Fixing miscalibration

Platt scaling fits a logistic function on a validation set
Isotonic regression fits a flexible monotonic mapping, needing more data

Key idea