← Lessons

quiz vs the machine

Silver1100

Machine Learning

The Stratified Sampling

How sampling within groups preserves rare classes and stabilizes evaluation.

4 min read · intro · beat Silver to climb

The problem with pure randomness

If you draw a small random sample from imbalanced data, rare groups can appear too few times or vanish entirely by chance. Stratified sampling fixes this by splitting the data into groups, called strata, and sampling within each one.

How it works

  • Define strata by a meaningful variable such as class label, region, or user segment.
  • Sample from each stratum so its representation matches a target, often its true proportion.
  • Combine the per stratum samples into the final set.

Why it helps

  • It guarantees that every group is present, so rare classes are not lost.
  • It reduces variance of estimates, since each group is measured rather than left to chance.
  • For evaluation, it produces stable metrics that do not swing based on which rare rows happened to be drawn.

Where it shows up

  • Building train test splits so both sides have the same class balance.
  • Cross validation folds that each preserve the label distribution.
  • Survey style estimates where each segment must be represented.

Key idea

Stratified sampling draws within meaningful groups so every stratum is represented, preserving rare classes and producing lower variance, stable estimates.

Check yourself

Answer to earn rating on the learn ladder.

1. What does stratified sampling guarantee?

2. Why use stratified train test splits?