Handling Missing Values
Real datasets have gaps. Before filling them, it helps to understand why a value is missing, because the mechanism guides the right response.
Missingness mechanisms
- Missing completely at random means the gap is unrelated to any value, so dropping rows is roughly safe.
- Missing at random means missingness depends on other observed columns, which models can account for.
- Missing not at random means the gap depends on the unseen value itself, which is the hardest case.
Basic options
- Drop rows or columns when missingness is rare or a column is mostly empty.
- Impute a substitute value such as the mean, median, or a learned estimate.
- Flag missingness with an extra indicator column so the model can use the fact that a value was absent.
Dropping is simple but throws away data and can bias results if missingness is informative. A missing indicator plus imputation often captures both the substitute value and the signal that the original was absent.
Key idea
Diagnose why data is missing, then choose between dropping, imputing, and flagging, often combining imputation with an indicator column.