A reference for boxes
Predicting object boxes from scratch is hard because position and size vary wildly. Anchor boxes give the detector a set of reference rectangles at every location, each with a fixed scale and aspect ratio.
Predict the adjustment
Instead of raw coordinates the network predicts a small offset from each anchor, how much to shift the center and rescale the width and height. Learning a correction is far easier than learning the absolute box.
- Several anchors per location cover different shapes.
- Each anchor also gets an objectness score.
Matching during training
Each ground truth box is assigned to anchors with high overlap, measured by intersection over union. Matched anchors learn the offset to the true box, unmatched ones learn background. Anchors in between are ignored to avoid noisy targets.
Choosing anchors
Scales and aspect ratios should reflect the dataset. Detecting people wants tall anchors, while detecting cars wants wide ones. Some methods cluster training boxes to pick good anchor shapes.
Key idea
Anchor boxes are preset reference rectangles so the detector predicts small offsets and an objectness score, with ground truth assigned by IoU and anchor shapes chosen to match the data.