← Lessons

quiz vs the machine

Platinum1750

Machine Learning

The Instance Segmentation Mask RCNN

Adding a mask head and ROI align for per object pixel masks.

6 min read · advanced · beat Platinum to climb

Beyond a box

Instance segmentation asks for a separate pixel mask for each object, distinguishing two overlapping cats as two instances. This is harder than semantic segmentation, which would merge them into one cat region.

Building on Faster RCNN

Mask RCNN extends Faster RCNN by adding a third branch. Alongside the classifier and box refiner, a small mask head predicts a binary mask for each proposed region, one mask per class.

  • The box head says where and what.
  • The mask head says which pixels belong.

Why ROI align matters

ROI pooling rounded coordinates to a grid, which shifted masks by a pixel or two. Mask RCNN replaces it with ROI align, which uses bilinear sampling without rounding. This small fix sharply improved mask quality because masks are sensitive to exact alignment.

Decoupling class and mask

The mask head predicts one mask per class and the classifier picks which one to use. Separating the two avoids competition between classes during mask learning, which improved results.

The payoff

With these pieces Mask RCNN produces accurate boxes, classes, and masks together in one shared network, becoming a standard baseline for instance level vision.

Key idea

Mask RCNN adds a per class mask head to Faster RCNN and swaps ROI pooling for ROI align, removing rounding so each object gets a precise pixel mask alongside its box and class.

Check yourself

Answer to earn rating on the learn ladder.

1. What does ROI align fix compared to ROI pooling?

2. How does instance segmentation differ from semantic segmentation?

3. Why predict one mask per class rather than one shared mask?