The problem with pure randomness
If you draw a small random sample from imbalanced data, rare groups can appear too few times or vanish entirely by chance. Stratified sampling fixes this by splitting the data into groups, called strata, and sampling within each one.
How it works
- Define strata by a meaningful variable such as class label, region, or user segment.
- Sample from each stratum so its representation matches a target, often its true proportion.
- Combine the per stratum samples into the final set.
Why it helps
- It guarantees that every group is present, so rare classes are not lost.
- It reduces variance of estimates, since each group is measured rather than left to chance.
- For evaluation, it produces stable metrics that do not swing based on which rare rows happened to be drawn.
Where it shows up
- Building train test splits so both sides have the same class balance.
- Cross validation folds that each preserve the label distribution.
- Survey style estimates where each segment must be represented.
Key idea
Stratified sampling draws within meaningful groups so every stratum is represented, preserving rare classes and producing lower variance, stable estimates.