Probabilities you can trust
A model that outputs 0.7 is calibrated if, among all such predictions, about 70 percent are actually positive. Calibration is separate from ranking: a model can rank perfectly yet output badly scaled probabilities.
Why it matters
Downstream decisions multiply probabilities by costs. An expected value calculation only works if the probability is honest.
Measuring calibration
- A reliability diagram bins predictions and plots predicted versus observed frequency. The diagonal is perfect
- Expected calibration error averages the gap across bins
- The Brier score is the mean squared error between predicted probability and the 0 or 1 outcome, lower is better
The Brier score rewards both calibration and sharpness, so it captures overall probabilistic quality in one number.
Fixing miscalibration
- Platt scaling fits a logistic function on a validation set
- Isotonic regression fits a flexible monotonic mapping, needing more data
Key idea
Calibration means predicted probabilities match observed frequencies. The Brier score grades them, and Platt or isotonic scaling repairs them.