← Lessons

quiz vs the machine

Silver1230

Machine Learning

ReLU And Its Variants

The activation that made deep networks trainable, and its modern cousins.

4 min read · intro · beat Silver to climb

What it is

An activation function adds nonlinearity so a network can model curves, not just straight lines. The rectified linear unit, or ReLU, simply outputs the input when positive and zero otherwise. Despite its simplicity it transformed deep learning.

Why ReLU works

  • It is cheap to compute, just a comparison with zero
  • For positive inputs its gradient is a constant one, which avoids the vanishing gradient that plagued sigmoid and tanh in deep stacks
  • It produces sparse activations since many units output exactly zero

The dying ReLU problem

A neuron stuck always outputting zero gets a zero gradient and never recovers, a failure called a dead unit. Variants address this:

  • Leaky ReLU lets a small slope through for negative inputs so the gradient is never exactly zero
  • GELU and SiLU are smooth curves used in modern transformers that often train a touch better

Key idea

ReLU passes positive inputs unchanged and zeroes the rest, giving cheap nonlinearity and strong gradients, while variants like leaky ReLU and GELU fix dead units.

Check yourself

Answer to earn rating on the learn ladder.

1. What does ReLU output for a negative input?

2. What problem does leaky ReLU address?