Handling Missing Data

Why values go missing

Real data has holes from broken sensors, skipped survey fields, or joins that did not match. How you fill them depends partly on why they are missing.

Missing at random gaps can often be filled from other columns.
Missing not at random gaps carry signal, like income left blank by high earners.

Common strategies

Drop rows or columns when missingness is small and harmless.
Impute with a simple statistic such as the mean, median, or most frequent value.
Model based imputation predicts the missing value from the other features.
Add a missingness indicator flag so the model knows a value was absent.

A critical rule

Compute imputation statistics from the training set only, then apply them to validation and test. Using the full dataset leaks future information backward and inflates scores.

Tradeoffs

Dropping is simple but throws away data and can bias results.
Mean imputation shrinks variance and can distort relationships.
An indicator flag preserves the signal that the gap itself carried.

Key idea

Choose a missing data strategy based on why values are absent, fit imputers on training data only, and consider flagging the gap itself as a feature.

Handling Missing Data

Why values go missing

Common strategies

A critical rule

Tradeoffs

Key idea

Check yourself