A label per pixel
Semantic segmentation assigns a class to every pixel rather than one label per image. This needs an output the same size as the input, which a plain classifier cannot give.
The encoder decoder shape
UNet has a symmetric design:
- The encoder downsamples, building rich but coarse features.
- The decoder upsamples back to full resolution.
The shape looks like the letter U, which gives the name.
The skip connections
Downsampling loses precise location. UNet fixes this by passing skip connections from each encoder level to the matching decoder level. The decoder then combines coarse semantics with the sharp spatial detail saved before pooling, producing crisp boundaries.
Why it works well on few images
UNet was designed for medical images where labeled data is scarce. The skip links and heavy augmentation let it learn precise masks from small datasets, which is why it remains a default for dense prediction.
Key idea
UNet encodes an image to coarse semantic features then decodes back to full resolution, using skip connections to restore spatial detail and produce crisp per pixel masks even from small datasets.