← Lessons

quiz vs the machine

Platinum1740

Machine Learning

Calibration and the Brier Score

Making predicted probabilities mean what they say, and measuring how well they do.

5 min read · advanced · beat Platinum to climb

Probabilities you can trust

A model that outputs 0.7 is calibrated if, among all such predictions, about 70 percent are actually positive. Calibration is separate from ranking: a model can rank perfectly yet output badly scaled probabilities.

Why it matters

Downstream decisions multiply probabilities by costs. An expected value calculation only works if the probability is honest.

Measuring calibration

  • A reliability diagram bins predictions and plots predicted versus observed frequency. The diagonal is perfect
  • Expected calibration error averages the gap across bins
  • The Brier score is the mean squared error between predicted probability and the 0 or 1 outcome, lower is better

The Brier score rewards both calibration and sharpness, so it captures overall probabilistic quality in one number.

Fixing miscalibration

  • Platt scaling fits a logistic function on a validation set
  • Isotonic regression fits a flexible monotonic mapping, needing more data

Key idea

Calibration means predicted probabilities match observed frequencies. The Brier score grades them, and Platt or isotonic scaling repairs them.

Check yourself

Answer to earn rating on the learn ladder.

1. A calibrated model outputs 0.3 for a group of items. What should hold?

2. What does the Brier score measure?

3. Which method repairs miscalibration with a flexible monotonic mapping?