What activations do
An activation function applies a nonlinearity elementwise so the network can model complex relationships rather than just linear ones.
ReLU
ReLU outputs the input when positive and zero otherwise.
- It is cheap and avoids the saturation that plagued sigmoid and tanh.
- It can cause dead units that get stuck at zero, which variants like leaky ReLU address.
GELU
GELU is a smooth activation that weights an input by the probability it survives under a Gaussian, giving a soft gated curve near zero.
- It is differentiable everywhere and tends to work well in transformers.
The right choice balances simplicity, gradient behavior, and empirical accuracy for the architecture at hand.
Key idea
ReLU is a cheap hard gate that zeros negatives, while GELU is a smooth probabilistic gate favored in transformers, both supplying the nonlinearity learning needs.