← Lessons

quiz vs the machine

Gold1380

Machine Learning

Handling Missing Data

Strategies for the gaps that real datasets always contain.

5 min read · core · beat Gold to climb

Why values go missing

Real data has holes from broken sensors, skipped survey fields, or joins that did not match. How you fill them depends partly on why they are missing.

  • Missing at random gaps can often be filled from other columns.
  • Missing not at random gaps carry signal, like income left blank by high earners.

Common strategies

  • Drop rows or columns when missingness is small and harmless.
  • Impute with a simple statistic such as the mean, median, or most frequent value.
  • Model based imputation predicts the missing value from the other features.
  • Add a missingness indicator flag so the model knows a value was absent.

A critical rule

Compute imputation statistics from the training set only, then apply them to validation and test. Using the full dataset leaks future information backward and inflates scores.

Tradeoffs

  • Dropping is simple but throws away data and can bias results.
  • Mean imputation shrinks variance and can distort relationships.
  • An indicator flag preserves the signal that the gap itself carried.

Key idea

Choose a missing data strategy based on why values are absent, fit imputers on training data only, and consider flagging the gap itself as a feature.

Check yourself

Answer to earn rating on the learn ladder.

1. Where should imputation statistics be computed?

2. Why add a missingness indicator?