The Activation Functions ReLU GELU

What activations do

An activation function applies a nonlinearity elementwise so the network can model complex relationships rather than just linear ones.

ReLU

ReLU outputs the input when positive and zero otherwise.

It is cheap and avoids the saturation that plagued sigmoid and tanh.
It can cause dead units that get stuck at zero, which variants like leaky ReLU address.

GELU

GELU is a smooth activation that weights an input by the probability it survives under a Gaussian, giving a soft gated curve near zero.

It is differentiable everywhere and tends to work well in transformers.

The right choice balances simplicity, gradient behavior, and empirical accuracy for the architecture at hand.

Key idea

ReLU is a cheap hard gate that zeros negatives, while GELU is a smooth probabilistic gate favored in transformers, both supplying the nonlinearity learning needs.

The Activation Functions ReLU GELU

What activations do

ReLU

GELU

Key idea

Check yourself