Gradient Boosted Trees
Random forests build trees in parallel and average them. Gradient boosting builds trees in sequence, each one correcting the mistakes of the ensemble so far.
Fitting the residuals
Boosting starts with a simple prediction such as the mean. Then it repeats a cycle.
- Compute the current errors, the gap between predictions and targets.
- Fit a new small tree to those errors, which are the negative gradients of the loss.
- Add a scaled version of that tree to the running model.
Each tree nudges the prediction in the direction that most reduces the loss, which is why the method is called gradient boosting.
Weak learners
The added trees are deliberately shallow, often just a few levels deep. A single shallow tree is a weak learner, but hundreds of them combined form a strong model. Keeping each tree weak prevents any one step from overfitting.
Bias and variance
Boosting mainly reduces bias by adding capacity step by step, in contrast to bagging which mainly reduces variance. Because it keeps fitting errors, boosting can overfit if run too long, so the number of trees is tuned carefully.
Key idea
Gradient boosting adds shallow trees in sequence, each fit to the current errors, to steadily reduce bias.