The Stratified Sampling

The problem with pure randomness

If you draw a small random sample from imbalanced data, rare groups can appear too few times or vanish entirely by chance. Stratified sampling fixes this by splitting the data into groups, called strata, and sampling within each one.

How it works

Define strata by a meaningful variable such as class label, region, or user segment.
Sample from each stratum so its representation matches a target, often its true proportion.
Combine the per stratum samples into the final set.

Why it helps

It guarantees that every group is present, so rare classes are not lost.
It reduces variance of estimates, since each group is measured rather than left to chance.
For evaluation, it produces stable metrics that do not swing based on which rare rows happened to be drawn.

Where it shows up

Building train test splits so both sides have the same class balance.
Cross validation folds that each preserve the label distribution.
Survey style estimates where each segment must be represented.

Key idea

Stratified sampling draws within meaningful groups so every stratum is represented, preserving rare classes and producing lower variance, stable estimates.

The Stratified Sampling

The problem with pure randomness

How it works

Why it helps

Where it shows up

Key idea

Check yourself