← Lessons

quiz vs the machine

Silver1100

Machine Learning

Stratified Sampling

Keep class proportions consistent across every data split.

4 min read · intro · beat Silver to climb

The risk of random splits

When you split data randomly, the proportions of each class can drift by chance. If a rare class is ten percent of the data, a random validation slice might land at five percent or fifteen percent, distorting your metrics.

What stratification does

Stratified sampling splits within each group, called a stratum, so the proportions match the full dataset. If the data is eighty twenty, every fold stays roughly eighty twenty.

  • It is especially important for imbalanced classification.
  • It keeps small classes represented in every fold.
  • It reduces variance between folds caused by uneven label mixes.

Beyond labels

You can stratify on things other than the target.

  • Stratify by a region or customer segment to keep each split representative.
  • Stratify by a binned continuous value to balance a numeric target.
  • Combine a few keys when one alone is not enough.

A caution

Stratification fixes label proportions, but it does not fix leakage from related rows. If the same user appears many times, group aware splitting is still needed on top.

Key idea

Stratified sampling preserves class proportions in every split, which matters most for imbalanced data and small classes.

Check yourself

Answer to earn rating on the learn ladder.

1. What does stratified sampling preserve across splits?

2. When is stratification most important?