L1 versus L2 Regularization Effects
Regularization adds a penalty on weight size to the loss, discouraging overly complex models. L1 and L2 penalties shrink weights but produce very different solutions.
The two penalties
- L2, also called weight decay, adds the sum of squared weights.
- L1 adds the sum of absolute weights.
- Both trade a little training fit for better generalization.
Different geometry
L2 pushes every weight smoothly toward zero but rarely makes any exactly zero, so it spreads importance across many small weights. L1 has a constant pull regardless of weight size, which drives many weights exactly to zero. That makes L1 a feature selector that produces sparse models you can inspect.
When to use which
Reach for L2 when you want smooth, stable shrinkage and believe most features matter a little. Reach for L1 when you suspect many features are useless and want the model to ignore them outright. The elastic net blends both to get sparsity with stability.
Key idea
L2 shrinks weights smoothly toward zero while L1 drives many to exactly zero, giving sparse, selectable models.