← Lessons

quiz vs the machine

Gold1340

Machine Learning

The Activation Functions ReLU GELU

Nonlinear gates shape what flows forward through a network.

4 min read · core · beat Gold to climb

What activations do

An activation function applies a nonlinearity elementwise so the network can model complex relationships rather than just linear ones.

ReLU

ReLU outputs the input when positive and zero otherwise.

  • It is cheap and avoids the saturation that plagued sigmoid and tanh.
  • It can cause dead units that get stuck at zero, which variants like leaky ReLU address.

GELU

GELU is a smooth activation that weights an input by the probability it survives under a Gaussian, giving a soft gated curve near zero.

  • It is differentiable everywhere and tends to work well in transformers.

The right choice balances simplicity, gradient behavior, and empirical accuracy for the architecture at hand.

Key idea

ReLU is a cheap hard gate that zeros negatives, while GELU is a smooth probabilistic gate favored in transformers, both supplying the nonlinearity learning needs.

Check yourself

Answer to earn rating on the learn ladder.

1. What does ReLU output for a negative input?

2. What characterizes GELU compared to ReLU?