The Gradient Boosting Deep

Building a strong model by fitting each new tree to the gradient of the loss so far.

Boosting as gradient descent in function space

Gradient boosting builds an ensemble one tree at a time. Each new tree is fit to the negative gradient of the loss with respect to the current predictions, which for squared error is just the residual. Adding trees is gradient descent, but the steps are functions, not weights.

The sequential recipe

Start with a constant prediction, such as the mean.
Compute the gradient of the loss at the current predictions, the pseudo residuals.
Fit a small tree to those pseudo residuals and add a shrunken version to the model.

The key hyperparameters

The learning rate shrinks each tree contribution. Smaller rates need more trees but generalize better.
Tree depth is kept shallow, often three to six, so each tree is a weak learner.
Subsampling rows and features adds randomness that fights overfitting, called stochastic gradient boosting.

Why it is powerful

By targeting the gradient of any differentiable loss, boosting handles regression, classification, and ranking with the same engine. The shrinkage and shallow trees make it accurate but slow to train and sensitive to its hyperparameters.

Key idea

Gradient boosting fits each new tree to the negative gradient of the loss, performing gradient descent in function space. A small learning rate with many shallow trees yields high accuracy on any differentiable loss.

The Gradient Boosting Deep

Boosting as gradient descent in function space

The sequential recipe

The key hyperparameters

Why it is powerful

Key idea

Check yourself