The Activation Function Choice

Why nonlinearity is essential

Stacking linear layers only produces another linear function. An activation function inserts a nonlinearity between layers so the network can model curves, boundaries, and interactions.

The common choices

Sigmoid squashes to zero and one but saturates, causing vanishing gradients in deep stacks.
Tanh is zero centered and slightly better behaved than sigmoid but still saturates.
ReLU outputs the input if positive, else zero. It is cheap and avoids saturation on the positive side, making it the default hidden activation.
Leaky ReLU and GELU soften the dead zone so negative inputs still pass a small gradient.

Picking one

The dying ReLU problem

A ReLU neuron stuck outputting zero for every input has a zero gradient and never recovers. Leaky and GELU variants keep a small slope for negatives to avoid this. Output layers are special: use softmax for multiclass and sigmoid for binary.

Key idea

ReLU is the sensible default hidden activation; switch to Leaky ReLU or GELU if neurons die. Match the output activation to the task using softmax or sigmoid.

The Activation Function Choice

Why nonlinearity is essential

The common choices

Picking one

The dying ReLU problem

Key idea

Check yourself