← Lessons

quiz vs the machine

Gold1380

Machine Learning

L1 and L2 Regularization

Penalizing big weights to fight overfitting and encourage sparsity.

5 min read · core · beat Gold to climb

Why regularize

A model that fits training noise perfectly often generalizes poorly. Regularization adds a penalty on the size of the parameters to the loss, discouraging overly complex solutions.

L2 regularization

L2, also called ridge, adds the sum of squared weights to the loss. It shrinks all weights smoothly toward zero but rarely makes any exactly zero. It is excellent at controlling variance and is the default in many libraries, sometimes called weight decay.

L1 regularization

L1, also called lasso, adds the sum of absolute weights. Its geometry drives some weights exactly to zero, producing a sparse model that effectively performs feature selection.

Choosing

  • Use L2 when you expect many small contributions from all features
  • Use L1 when you suspect only a few features truly matter
  • Elastic net blends both penalties

The strength knob

A coefficient controls how strong the penalty is. Too strong and the model underfits, too weak and it overfits. This coefficient is tuned with cross validation.

Key idea

L2 shrinks weights smoothly while L1 drives some to exactly zero, and both trade a little training fit for better generalization.

Check yourself

Answer to earn rating on the learn ladder.

1. Which regularizer tends to produce sparse weights?

2. What does the regularization coefficient control?

3. What happens if the penalty is far too strong?