Imputation Strategies
Imputation replaces missing entries with estimated values so the dataset stays complete. Strategies range from trivial statistics to small predictive models.
Simple imputation
- Mean or median fills numeric columns; median resists outliers.
- Mode fills categorical columns with the most frequent category.
- Constant fills with a sentinel like zero or an explicit unknown label.
These are fast and stable but ignore relationships between columns, which can distort variance and correlations.
Model based imputation
- KNN imputation fills a gap using the values of the nearest similar rows.
- Iterative imputation models each column with missing values as a function of the others, cycling until estimates stabilize.
Model based methods respect correlations and often improve accuracy, at the cost of compute and a risk of overfitting the imputer. Whatever you choose, fit the imputer on the training fold only and apply the learned statistics to validation and test data to avoid leakage.
Key idea
Imputation spans simple statistic fills and correlation aware model based methods, but the imputer must always be fit on training data only.