Anomaly Detection With Isolation Forest

The isolation forest detects anomalies with a clever idea: outliers are few and different, so they are easy to isolate. The method needs no labels and scales well to large datasets.

Isolating points with random trees

The algorithm builds many random trees. To build one tree it repeatedly:

picks a random feature, and
picks a random split value between that feature minimum and maximum.

This partitions the data until each point lands alone in a leaf. The path length from the root to a point is how many splits it took to isolate it.

Anomaly score from path length

Normal points sit in dense regions and need many splits to isolate, giving long paths. Anomalies sit apart and get separated quickly, giving short paths. Averaging path lengths across the whole forest produces an anomaly score: short average path means likely anomaly.

Why it works well

It targets anomalies directly rather than modeling normal data in full.
Its cost grows roughly linearly with the number of points.
It handles many features without computing distances.

A small contamination parameter sets the threshold for how many points to flag.

Key idea

An isolation forest flags anomalies by random partitioning, since outliers are isolated in few splits and have short average path lengths.

Anomaly Detection With Isolation Forest