← Lessons

quiz vs the machine

Gold1500

Machine Learning

The Cross Validation Pitfalls

Avoid the subtle ways cross validation lies about generalization.

6 min read · core · beat Gold to climb

Why it can mislead

Cross validation estimates generalization by rotating which fold is held out. It only works if each fold is a fair independent stand in for unseen data. Several common mistakes break that assumption.

  • Random folds on grouped data split related rows across folds, leaking within groups.
  • Random folds on time series let future leak into past.
  • Preprocessing outside the fold leaks test statistics.

Matching the split to the data

The split must mirror how predictions are used in reality.

  • Use group k fold when rows share an entity like a user or patient.
  • Use time based splits for temporal data, training on past only.
  • Apply stratification to keep class balance across folds.

Choosing a split

The right split makes the estimate honest.

Key idea

Cross validation only estimates generalization honestly when folds are independent and match real usage, so use group, time based, or stratified splits and keep preprocessing inside each fold.

Check yourself

Answer to earn rating on the learn ladder.

1. Why use group k fold instead of random folds?

2. What is the right split for time series data?

3. Where should preprocessing be fit during cross validation?