The Label Smoothing

The overconfidence problem

With one hot targets, cross entropy pushes the correct logit toward infinity relative to the rest. The model becomes overconfident and poorly calibrated, assigning near certainty even when wrong.

What smoothing does

Label smoothing replaces the hard one with a slightly lower value and spreads the remaining small mass evenly across the other classes. A common setting reserves a tenth of the probability for the wrong classes. The target is no longer extreme, so logits stay bounded.

The transformation

Why it helps

Predictions become better calibrated, so confidence tracks accuracy.
The network is discouraged from chasing infinite logits, which improves generalization.
It tightens the clustering of representations within a class.

Practical notes

A smoothing value around 0.1 is a common default.
It can slightly hurt if you later need the raw logits for distillation, where sharper targets matter.
It pairs well with mixup, which also produces soft labels.

Key idea

Label smoothing softens one hot targets by reserving a little probability for other classes. This bounds logits, improves calibration, and aids generalization at the cost of slightly fuzzier targets.