The core idea
Dropout randomly zeroes a fraction of activations during training. Each step trains a different thinned subnetwork, so neurons cannot rely on specific partners. The result is an implicit ensemble that resists overfitting.
Scaling correctly
To keep the expected output the same, surviving activations are scaled up by one over the keep probability during training. This inverted dropout means inference needs no change, just turn dropout off.
Variants for structure
- Spatial dropout drops entire feature maps in convolutional nets, since neighboring pixels are correlated and dropping single units does little.
- DropConnect zeroes individual weights rather than activations.
- DropPath drops whole residual branches, common in deep architectures.
How it fits training
Practical tuning
- Rates of 0.1 to 0.5 are typical; higher rates suit fully connected layers near the output.
- Modern conv nets often lean on batch norm and use little dropout.
- Too much dropout starves the network and underfits.
Key idea
Dropout trains a random subnetwork each step to build an implicit ensemble. Use spatial dropout for conv features and remember inference runs the full, undropped network.