The risk of random splits
When you split data randomly, the proportions of each class can drift by chance. If a rare class is ten percent of the data, a random validation slice might land at five percent or fifteen percent, distorting your metrics.
What stratification does
Stratified sampling splits within each group, called a stratum, so the proportions match the full dataset. If the data is eighty twenty, every fold stays roughly eighty twenty.
- It is especially important for imbalanced classification.
- It keeps small classes represented in every fold.
- It reduces variance between folds caused by uneven label mixes.
Beyond labels
You can stratify on things other than the target.
- Stratify by a region or customer segment to keep each split representative.
- Stratify by a binned continuous value to balance a numeric target.
- Combine a few keys when one alone is not enough.
A caution
Stratification fixes label proportions, but it does not fix leakage from related rows. If the same user appears many times, group aware splitting is still needed on top.
Key idea
Stratified sampling preserves class proportions in every split, which matters most for imbalanced data and small classes.