What calibration means
A model is calibrated when its confidence matches reality. Among all predictions made with seventy percent confidence, about seventy percent should be correct. A model can be highly accurate yet poorly calibrated, reporting ninety nine percent confidence when it is right only eighty percent of the time.
Why it matters
Probabilities feed downstream decisions, risk thresholds, and human trust. Overconfident outputs lead to bad automated choices and miscalibrated risk. Modern deep networks tend to be overconfident by default.
Measuring it
- A reliability diagram plots predicted confidence against observed accuracy
- The expected calibration error averages the gap between confidence and accuracy across bins
- A perfectly calibrated model lies on the diagonal
Fixing it
Calibration is usually a cheap post processing step fit on held out data.
- Temperature scaling divides the logits by a single learned temperature, softening or sharpening the probabilities without changing the predicted class
- Platt scaling fits a logistic transform to the scores
- Isotonic regression fits a flexible monotonic mapping
Temperature scaling is the most popular because it is simple, leaves accuracy untouched, and only needs one parameter tuned on a validation set.
Key idea
Calibration aligns predicted confidence with real accuracy, often via cheap post hoc methods like temperature scaling.