← Lessons

quiz vs the machine

Silver1100

Machine Learning

The Activation Function Choice

How nonlinearities differ and which one to reach for in modern deep networks.

4 min read · intro · beat Silver to climb

Why nonlinearity is essential

Stacking linear layers only produces another linear function. An activation function inserts a nonlinearity between layers so the network can model curves, boundaries, and interactions.

The common choices

  • Sigmoid squashes to zero and one but saturates, causing vanishing gradients in deep stacks.
  • Tanh is zero centered and slightly better behaved than sigmoid but still saturates.
  • ReLU outputs the input if positive, else zero. It is cheap and avoids saturation on the positive side, making it the default hidden activation.
  • Leaky ReLU and GELU soften the dead zone so negative inputs still pass a small gradient.

Picking one

The dying ReLU problem

A ReLU neuron stuck outputting zero for every input has a zero gradient and never recovers. Leaky and GELU variants keep a small slope for negatives to avoid this. Output layers are special: use softmax for multiclass and sigmoid for binary.

Key idea

ReLU is the sensible default hidden activation; switch to Leaky ReLU or GELU if neurons die. Match the output activation to the task using softmax or sigmoid.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the dying ReLU problem?

2. Why can a network of stacked linear layers not benefit from depth without activations?