Why nonlinearity is essential
Stacking linear layers only produces another linear function. An activation function inserts a nonlinearity between layers so the network can model curves, boundaries, and interactions.
The common choices
- Sigmoid squashes to zero and one but saturates, causing vanishing gradients in deep stacks.
- Tanh is zero centered and slightly better behaved than sigmoid but still saturates.
- ReLU outputs the input if positive, else zero. It is cheap and avoids saturation on the positive side, making it the default hidden activation.
- Leaky ReLU and GELU soften the dead zone so negative inputs still pass a small gradient.
Picking one
The dying ReLU problem
A ReLU neuron stuck outputting zero for every input has a zero gradient and never recovers. Leaky and GELU variants keep a small slope for negatives to avoid this. Output layers are special: use softmax for multiclass and sigmoid for binary.
Key idea
ReLU is the sensible default hidden activation; switch to Leaky ReLU or GELU if neurons die. Match the output activation to the task using softmax or sigmoid.