Anomaly Detection With Isolation Forest
The isolation forest detects anomalies with a clever idea: outliers are few and different, so they are easy to isolate. The method needs no labels and scales well to large datasets.
Isolating points with random trees
The algorithm builds many random trees. To build one tree it repeatedly:
- picks a random feature, and
- picks a random split value between that feature minimum and maximum.
This partitions the data until each point lands alone in a leaf. The path length from the root to a point is how many splits it took to isolate it.
Anomaly score from path length
Normal points sit in dense regions and need many splits to isolate, giving long paths. Anomalies sit apart and get separated quickly, giving short paths. Averaging path lengths across the whole forest produces an anomaly score: short average path means likely anomaly.
Why it works well
- It targets anomalies directly rather than modeling normal data in full.
- Its cost grows roughly linearly with the number of points.
- It handles many features without computing distances.
A small contamination parameter sets the threshold for how many points to flag.
Key idea
An isolation forest flags anomalies by random partitioning, since outliers are isolated in few splits and have short average path lengths.