← Lessons

quiz vs the machine

Gold1400

Machine Learning

Outlier Detection

Spot the points that sit far from the rest of the data.

5 min read · core · beat Gold to climb

What is an outlier

An outlier is a point that lies far from the bulk of the data. It might be a typo, a sensor glitch, or a genuine rare event. Deciding which is the hard part.

Simple statistical methods

  • The z score flags points more than a few standard deviations from the mean.
  • The interquartile range rule flags points below the first quartile or above the third quartile by more than one and a half times the spread.
  • These assume a roughly symmetric distribution and weaken with skew.

Model based methods

  • Isolation forest isolates points with random splits, and outliers need fewer splits to separate.
  • Local outlier factor compares a point density to its neighbors density.
  • These handle complex shapes that simple thresholds miss.

What to do with them

  • Investigate before deleting, since real anomalies may be the most valuable rows.
  • Cap extreme values to a sensible bound, a step called winsorizing.
  • Keep them and use models that tolerate them, like tree based methods.

Key idea

Outlier detection mixes simple statistical rules with model based methods, but the decision to drop, cap, or keep a point depends on whether it is an error or a real signal.

Check yourself

Answer to earn rating on the learn ladder.

1. How does isolation forest identify outliers?

2. What is a safe first response to a detected outlier?