The overconfidence problem
With one hot targets, cross entropy pushes the correct logit toward infinity relative to the rest. The model becomes overconfident and poorly calibrated, assigning near certainty even when wrong.
What smoothing does
Label smoothing replaces the hard one with a slightly lower value and spreads the remaining small mass evenly across the other classes. A common setting reserves a tenth of the probability for the wrong classes. The target is no longer extreme, so logits stay bounded.
The transformation
Why it helps
- Predictions become better calibrated, so confidence tracks accuracy.
- The network is discouraged from chasing infinite logits, which improves generalization.
- It tightens the clustering of representations within a class.
Practical notes
- A smoothing value around 0.1 is a common default.
- It can slightly hurt if you later need the raw logits for distillation, where sharper targets matter.
- It pairs well with mixup, which also produces soft labels.
Key idea
Label smoothing softens one hot targets by reserving a little probability for other classes. This bounds logits, improves calibration, and aids generalization at the cost of slightly fuzzier targets.