Normalization and Standardization
Raw features often live on wildly different scales. Income might range in the thousands while age ranges in the tens. Many algorithms struggle when scales differ, so we rescale features first.
Two common transforms do this:
- Normalization, often min max scaling, squeezes each feature into a fixed range such as zero to one
- Standardization shifts each feature to have a mean of zero and a standard deviation of one
Standardization is computed by subtracting the mean and dividing by the standard deviation of the feature. Normalization is computed by subtracting the minimum and dividing by the range.
Why bother? Distance based methods and gradient descent both treat large numbers as more important by accident. A feature with a big raw scale can dominate the loss and warp the optimization, making training slow or unstable. Putting features on a comparable scale lets each contribute fairly and helps the optimizer converge.
One crucial rule: compute the scaling statistics on the training set only, then apply the same transform to validation and test data. Fitting the scaler on all data leaks information and inflates your scores.
Key idea
Rescaling features through normalization or standardization puts them on a common scale; fit the statistics on training data only to avoid leakage.