← Lessons

quiz vs the machine

Gold1430

Machine Learning

Handling Class Imbalance

Training fair models when one class vastly outnumbers another.

5 min read · core · beat Gold to climb

The problem

In many real tasks one class is rare. Fraud, disease, and defects might appear in well under one percent of examples. A model can score high accuracy by always predicting the majority class while completely failing on the cases that matter.

Why accuracy lies

With a ninety nine to one split, predicting the majority every time gives ninety nine percent accuracy and zero useful detection. You must look at precision, recall, and the F1 score for the minority class instead.

Data level fixes

  • Oversampling repeats or synthesizes minority examples, as in SMOTE
  • Undersampling drops some majority examples
  • Class balanced sampling draws batches so classes appear more evenly

Loss level fixes

  • Class weights make minority mistakes cost more in the loss
  • Focal loss down weights easy examples so the model focuses on hard minority cases

Threshold and evaluation

The default decision threshold of one half is rarely optimal under imbalance. Tuning the threshold on a validation set, and reporting the precision recall curve, gives a far truer picture than raw accuracy. Always evaluate with metrics that reflect the cost of missing the rare class.

Key idea

With imbalance, accuracy misleads, so combine resampling or reweighting with threshold tuning and minority focused metrics.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is accuracy a poor metric under heavy class imbalance?

2. What does focal loss do?

3. Which pair of metrics best captures minority performance?