The Dropout Variants

Standard dropout and its spatial and structured cousins for regularizing networks.

The core idea

Dropout randomly zeroes a fraction of activations during training. Each step trains a different thinned subnetwork, so neurons cannot rely on specific partners. The result is an implicit ensemble that resists overfitting.

Scaling correctly

To keep the expected output the same, surviving activations are scaled up by one over the keep probability during training. This inverted dropout means inference needs no change, just turn dropout off.

Variants for structure

Spatial dropout drops entire feature maps in convolutional nets, since neighboring pixels are correlated and dropping single units does little.
DropConnect zeroes individual weights rather than activations.
DropPath drops whole residual branches, common in deep architectures.

How it fits training

Practical tuning

Rates of 0.1 to 0.5 are typical; higher rates suit fully connected layers near the output.
Modern conv nets often lean on batch norm and use little dropout.
Too much dropout starves the network and underfits.

Key idea

Dropout trains a random subnetwork each step to build an implicit ensemble. Use spatial dropout for conv features and remember inference runs the full, undropped network.