Outlier Detection and Treatment
An outlier is a value far from the bulk of the data. Some are errors, others are rare but genuine events, so detection and treatment require judgment.
Detecting outliers
- Z score flags points more than a few standard deviations from the mean, suited to roughly normal data.
- Interquartile range flags points below the first quartile or above the third quartile by more than one and a half times the IQR, which is robust to skew.
- Model based methods like isolation forests score how isolated a point is in feature space.
Treating outliers
- Remove them when they are clear data entry errors.
- Cap or winsorize by clipping values to a percentile threshold.
- Transform with a log to compress a long tail.
- Keep them when they represent the rare cases you actually want to model.
Blindly deleting outliers can erase the very signal that matters, such as fraud or equipment failure. Always ask whether an extreme value is noise or information before acting.
Key idea
Detect outliers with z score, IQR, or model based methods, then decide to remove, cap, transform, or keep based on whether they are noise or genuine signal.