Why regularize
A model that fits training noise perfectly often generalizes poorly. Regularization adds a penalty on the size of the parameters to the loss, discouraging overly complex solutions.
L2 regularization
L2, also called ridge, adds the sum of squared weights to the loss. It shrinks all weights smoothly toward zero but rarely makes any exactly zero. It is excellent at controlling variance and is the default in many libraries, sometimes called weight decay.
L1 regularization
L1, also called lasso, adds the sum of absolute weights. Its geometry drives some weights exactly to zero, producing a sparse model that effectively performs feature selection.
Choosing
- Use L2 when you expect many small contributions from all features
- Use L1 when you suspect only a few features truly matter
- Elastic net blends both penalties
The strength knob
A coefficient controls how strong the penalty is. Too strong and the model underfits, too weak and it overfits. This coefficient is tuned with cross validation.
Key idea
L2 shrinks weights smoothly while L1 drives some to exactly zero, and both trade a little training fit for better generalization.