The problem
When one class hugely outnumbers another, a model can score high accuracy by ignoring the rare class entirely. Class imbalance wrecks the metric you usually care about, detecting the rare event.
Resampling
- Oversampling duplicates or synthesizes minority examples, as with synthetic interpolation methods.
- Undersampling drops majority examples to balance counts, risking lost information.
- Resample only the training split to avoid leaking into evaluation.
Reweighting
- Give the minority class a larger loss weight so its mistakes cost more.
- Many algorithms accept class weights directly, no resampling needed.
Metrics and thresholds
- Track precision, recall, and the area under the precision recall curve, not raw accuracy.
- Tune the decision threshold toward the rare class.
Key idea
Class imbalance lets a model ignore the rare class. Resampling, class reweighting, and threshold tuning restore balance, judged by precision and recall rather than accuracy.