What it is
An activation function adds nonlinearity so a network can model curves, not just straight lines. The rectified linear unit, or ReLU, simply outputs the input when positive and zero otherwise. Despite its simplicity it transformed deep learning.
Why ReLU works
- It is cheap to compute, just a comparison with zero
- For positive inputs its gradient is a constant one, which avoids the vanishing gradient that plagued sigmoid and tanh in deep stacks
- It produces sparse activations since many units output exactly zero
The dying ReLU problem
A neuron stuck always outputting zero gets a zero gradient and never recovers, a failure called a dead unit. Variants address this:
- Leaky ReLU lets a small slope through for negative inputs so the gradient is never exactly zero
- GELU and SiLU are smooth curves used in modern transformers that often train a touch better
Key idea
ReLU passes positive inputs unchanged and zeroes the rest, giving cheap nonlinearity and strong gradients, while variants like leaky ReLU and GELU fix dead units.