L1 and L2 Regularization

Why regularize

A model that fits training noise perfectly often generalizes poorly. Regularization adds a penalty on the size of the parameters to the loss, discouraging overly complex solutions.

L2 regularization

L2, also called ridge, adds the sum of squared weights to the loss. It shrinks all weights smoothly toward zero but rarely makes any exactly zero. It is excellent at controlling variance and is the default in many libraries, sometimes called weight decay.

L1 regularization

L1, also called lasso, adds the sum of absolute weights. Its geometry drives some weights exactly to zero, producing a sparse model that effectively performs feature selection.

Choosing

Use L2 when you expect many small contributions from all features
Use L1 when you suspect only a few features truly matter
Elastic net blends both penalties

The strength knob

A coefficient controls how strong the penalty is. Too strong and the model underfits, too weak and it overfits. This coefficient is tuned with cross validation.

Key idea

L2 shrinks weights smoothly while L1 drives some to exactly zero, and both trade a little training fit for better generalization.