Handling Imbalanced Data
When one class is far rarer than another, such as fraud among normal transactions, models tend to ignore the minority class and still report high accuracy. Imbalance demands special handling.
Why accuracy misleads
If ninety nine percent of rows are negative, a model predicting always negative scores ninety nine percent accuracy while catching zero positives. Use metrics that focus on the rare class.
- Precision and recall describe correctness and coverage of the positive class.
- F1 score balances the two, and the precision recall curve suits heavy imbalance.
Rebalancing techniques
- Oversampling the minority, for example duplicating or synthesizing with SMOTE.
- Undersampling the majority to even the counts.
- Class weights that make minority errors cost more during training.
A critical rule
Apply resampling only to the training fold, never to validation or test data, and never before splitting. Resampling the whole dataset leaks information and gives an unrealistically optimistic score.
Key idea
Imbalanced data needs minority focused metrics and rebalancing through resampling or class weights, applied only to the training fold to avoid leakage.