← Lessons

quiz vs the machine

Gold1350

Machine Learning

The Dropout Variants

Standard dropout and its spatial and structured cousins for regularizing networks.

4 min read · core · beat Gold to climb

The core idea

Dropout randomly zeroes a fraction of activations during training. Each step trains a different thinned subnetwork, so neurons cannot rely on specific partners. The result is an implicit ensemble that resists overfitting.

Scaling correctly

To keep the expected output the same, surviving activations are scaled up by one over the keep probability during training. This inverted dropout means inference needs no change, just turn dropout off.

Variants for structure

  • Spatial dropout drops entire feature maps in convolutional nets, since neighboring pixels are correlated and dropping single units does little.
  • DropConnect zeroes individual weights rather than activations.
  • DropPath drops whole residual branches, common in deep architectures.

How it fits training

Practical tuning

  • Rates of 0.1 to 0.5 are typical; higher rates suit fully connected layers near the output.
  • Modern conv nets often lean on batch norm and use little dropout.
  • Too much dropout starves the network and underfits.

Key idea

Dropout trains a random subnetwork each step to build an implicit ensemble. Use spatial dropout for conv features and remember inference runs the full, undropped network.

Check yourself

Answer to earn rating on the learn ladder.

1. Why does inverted dropout scale surviving activations during training?

2. Why is spatial dropout used instead of standard dropout in convolutional layers?