← Lessons

quiz vs the machine

Platinum1730

Machine Learning

Random Forests

Averaging many decorrelated trees to cut variance.

5 min read · advanced · beat Platinum to climb

The idea

A single deep decision tree has low bias but high variance. A random forest trains many trees and averages them, which keeps the low bias while sharply reducing variance.

Two sources of randomness

For the trees to help, their errors must be different from each other. Random forests inject randomness twice:

  • Bagging trains each tree on a bootstrap sample, a random draw of rows with replacement
  • At each split, only a random subset of features is considered, which decorrelates the trees

Combining predictions

For classification the forest takes a majority vote across trees. For regression it averages their outputs. Because the trees make different mistakes, the errors partly cancel.

Handy extras

The rows left out of each bootstrap form an out of bag set that gives a free validation estimate. Forests also rank feature importance.

Key idea

Random forests average many decorrelated trees built with bagging and random feature subsets, slashing variance without raising bias.

Check yourself

Answer to earn rating on the learn ladder.

1. Why do random forests pick a random feature subset at each split?

2. What is the out of bag set?