ReLU And Its Variants

What it is

An activation function adds nonlinearity so a network can model curves, not just straight lines. The rectified linear unit, or ReLU, simply outputs the input when positive and zero otherwise. Despite its simplicity it transformed deep learning.

Why ReLU works

It is cheap to compute, just a comparison with zero
For positive inputs its gradient is a constant one, which avoids the vanishing gradient that plagued sigmoid and tanh in deep stacks
It produces sparse activations since many units output exactly zero

The dying ReLU problem

A neuron stuck always outputting zero gets a zero gradient and never recovers, a failure called a dead unit. Variants address this:

Leaky ReLU lets a small slope through for negative inputs so the gradient is never exactly zero
GELU and SiLU are smooth curves used in modern transformers that often train a touch better

Key idea

ReLU passes positive inputs unchanged and zeroes the rest, giving cheap nonlinearity and strong gradients, while variants like leaky ReLU and GELU fix dead units.

ReLU And Its Variants

What it is

Why ReLU works

The dying ReLU problem

Key idea

Check yourself