Detection in one shot
YOLO, you only look once, treats detection as a single regression over the whole image. One forward pass outputs all boxes and classes, which makes it fast enough for real time use.
The grid view
The image is divided into a grid of cells. Each cell is responsible for objects whose center falls inside it and predicts:
- A set of boxes with positions and sizes.
- An objectness score per box.
- Class probabilities for the cell.
This single stage design contrasts with two stage methods that first propose regions.
Why it is fast
Because the network produces everything at once, there is no separate proposal pass and no repeated cropping. The whole image is processed in one go, so speed stays high.
The trade
Early versions struggled with small or clustered objects because each cell predicts a limited number of boxes. Later versions added anchors, multi scale features, and finer grids to close much of the accuracy gap with two stage detectors.
Key idea
YOLO predicts all boxes, objectness, and classes in one pass over a grid, trading some accuracy on small clustered objects for real time speed that later versions largely recovered.