What is an outlier
An outlier is a point that lies far from the bulk of the data. It might be a typo, a sensor glitch, or a genuine rare event. Deciding which is the hard part.
Simple statistical methods
- The z score flags points more than a few standard deviations from the mean.
- The interquartile range rule flags points below the first quartile or above the third quartile by more than one and a half times the spread.
- These assume a roughly symmetric distribution and weaken with skew.
Model based methods
- Isolation forest isolates points with random splits, and outliers need fewer splits to separate.
- Local outlier factor compares a point density to its neighbors density.
- These handle complex shapes that simple thresholds miss.
What to do with them
- Investigate before deleting, since real anomalies may be the most valuable rows.
- Cap extreme values to a sensible bound, a step called winsorizing.
- Keep them and use models that tolerate them, like tree based methods.
Key idea
Outlier detection mixes simple statistical rules with model based methods, but the decision to drop, cap, or keep a point depends on whether it is an error or a real signal.