Why values go missing
Real data has holes from broken sensors, skipped survey fields, or joins that did not match. How you fill them depends partly on why they are missing.
- Missing at random gaps can often be filled from other columns.
- Missing not at random gaps carry signal, like income left blank by high earners.
Common strategies
- Drop rows or columns when missingness is small and harmless.
- Impute with a simple statistic such as the mean, median, or most frequent value.
- Model based imputation predicts the missing value from the other features.
- Add a missingness indicator flag so the model knows a value was absent.
A critical rule
Compute imputation statistics from the training set only, then apply them to validation and test. Using the full dataset leaks future information backward and inflates scores.
Tradeoffs
- Dropping is simple but throws away data and can bias results.
- Mean imputation shrinks variance and can distort relationships.
- An indicator flag preserves the signal that the gap itself carried.
Key idea
Choose a missing data strategy based on why values are absent, fit imputers on training data only, and consider flagging the gap itself as a feature.