Boosting as gradient descent in function space
Gradient boosting builds an ensemble one tree at a time. Each new tree is fit to the negative gradient of the loss with respect to the current predictions, which for squared error is just the residual. Adding trees is gradient descent, but the steps are functions, not weights.
The sequential recipe
- Start with a constant prediction, such as the mean.
- Compute the gradient of the loss at the current predictions, the pseudo residuals.
- Fit a small tree to those pseudo residuals and add a shrunken version to the model.
The key hyperparameters
- The learning rate shrinks each tree contribution. Smaller rates need more trees but generalize better.
- Tree depth is kept shallow, often three to six, so each tree is a weak learner.
- Subsampling rows and features adds randomness that fights overfitting, called stochastic gradient boosting.
Why it is powerful
By targeting the gradient of any differentiable loss, boosting handles regression, classification, and ranking with the same engine. The shrinkage and shallow trees make it accurate but slow to train and sensitive to its hyperparameters.
Key idea
Gradient boosting fits each new tree to the negative gradient of the loss, performing gradient descent in function space. A small learning rate with many shallow trees yields high accuracy on any differentiable loss.