Why it can mislead
Cross validation estimates generalization by rotating which fold is held out. It only works if each fold is a fair independent stand in for unseen data. Several common mistakes break that assumption.
- Random folds on grouped data split related rows across folds, leaking within groups.
- Random folds on time series let future leak into past.
- Preprocessing outside the fold leaks test statistics.
Matching the split to the data
The split must mirror how predictions are used in reality.
- Use group k fold when rows share an entity like a user or patient.
- Use time based splits for temporal data, training on past only.
- Apply stratification to keep class balance across folds.
Choosing a split
The right split makes the estimate honest.
Key idea
Cross validation only estimates generalization honestly when folds are independent and match real usage, so use group, time based, or stratified splits and keep preprocessing inside each fold.