Why penalize coefficients
When features are many or correlated, least squares can produce huge, unstable weights. Regularization adds a penalty on coefficient size to the loss, trading a little bias for much lower variance.
Ridge versus lasso
- Ridge adds the sum of squared weights, an L2 penalty. It shrinks all coefficients smoothly toward zero but rarely sets any to exactly zero.
- Lasso adds the sum of absolute weights, an L1 penalty. Its corner shaped constraint pushes some coefficients to exactly zero, performing feature selection.
Choosing between them
- Use ridge when you believe most features matter a little and want stability.
- Use lasso when you expect only a few features truly matter and want a sparse model.
- Elastic net blends both, keeping sparsity while handling correlated groups.
The penalty strength is a hyperparameter tuned by cross validation. Always standardize features first so the penalty treats them fairly.
Key idea
Ridge shrinks all coefficients smoothly with an L2 penalty for stability, while lasso uses an L1 penalty to drive some coefficients to exactly zero for feature selection. Elastic net combines both.