Handling Imbalanced Data

Address rare class problems with resampling, class weights, and the right metrics.

Handling Imbalanced Data

When one class is far rarer than another, such as fraud among normal transactions, models tend to ignore the minority class and still report high accuracy. Imbalance demands special handling.

Why accuracy misleads

If ninety nine percent of rows are negative, a model predicting always negative scores ninety nine percent accuracy while catching zero positives. Use metrics that focus on the rare class.

Precision and recall describe correctness and coverage of the positive class.
F1 score balances the two, and the precision recall curve suits heavy imbalance.

Rebalancing techniques

Oversampling the minority, for example duplicating or synthesizing with SMOTE.
Undersampling the majority to even the counts.
Class weights that make minority errors cost more during training.

A critical rule

Apply resampling only to the training fold, never to validation or test data, and never before splitting. Resampling the whole dataset leaks information and gives an unrealistically optimistic score.

Key idea

Imbalanced data needs minority focused metrics and rebalancing through resampling or class weights, applied only to the training fold to avoid leakage.

Handling Imbalanced Data