← Lessons

quiz vs the machine

Silver1080

Machine Learning

Handling Missing Values

Understand why data goes missing and the basic options for dealing with gaps.

4 min read · intro · beat Silver to climb

Handling Missing Values

Real datasets have gaps. Before filling them, it helps to understand why a value is missing, because the mechanism guides the right response.

Missingness mechanisms

  • Missing completely at random means the gap is unrelated to any value, so dropping rows is roughly safe.
  • Missing at random means missingness depends on other observed columns, which models can account for.
  • Missing not at random means the gap depends on the unseen value itself, which is the hardest case.

Basic options

  • Drop rows or columns when missingness is rare or a column is mostly empty.
  • Impute a substitute value such as the mean, median, or a learned estimate.
  • Flag missingness with an extra indicator column so the model can use the fact that a value was absent.

Dropping is simple but throws away data and can bias results if missingness is informative. A missing indicator plus imputation often captures both the substitute value and the signal that the original was absent.

Key idea

Diagnose why data is missing, then choose between dropping, imputing, and flagging, often combining imputation with an indicator column.

Check yourself

Answer to earn rating on the learn ladder.

1. What does missing not at random mean?

2. Why add a missing indicator column?