Beyond a box
Instance segmentation asks for a separate pixel mask for each object, distinguishing two overlapping cats as two instances. This is harder than semantic segmentation, which would merge them into one cat region.
Building on Faster RCNN
Mask RCNN extends Faster RCNN by adding a third branch. Alongside the classifier and box refiner, a small mask head predicts a binary mask for each proposed region, one mask per class.
- The box head says where and what.
- The mask head says which pixels belong.
Why ROI align matters
ROI pooling rounded coordinates to a grid, which shifted masks by a pixel or two. Mask RCNN replaces it with ROI align, which uses bilinear sampling without rounding. This small fix sharply improved mask quality because masks are sensitive to exact alignment.
Decoupling class and mask
The mask head predicts one mask per class and the classifier picks which one to use. Separating the two avoids competition between classes during mask learning, which improved results.
The payoff
With these pieces Mask RCNN produces accurate boxes, classes, and masks together in one shared network, becoming a standard baseline for instance level vision.
Key idea
Mask RCNN adds a per class mask head to Faster RCNN and swaps ROI pooling for ROI align, removing rounding so each object gets a precise pixel mask alongside its box and class.