← Lessons

quiz vs the machine

Gold1450

Machine Learning

The Gradient Boosting Deep

Building a strong model by fitting each new tree to the gradient of the loss so far.

5 min read · core · beat Gold to climb

Boosting as gradient descent in function space

Gradient boosting builds an ensemble one tree at a time. Each new tree is fit to the negative gradient of the loss with respect to the current predictions, which for squared error is just the residual. Adding trees is gradient descent, but the steps are functions, not weights.

The sequential recipe

  • Start with a constant prediction, such as the mean.
  • Compute the gradient of the loss at the current predictions, the pseudo residuals.
  • Fit a small tree to those pseudo residuals and add a shrunken version to the model.

The key hyperparameters

  • The learning rate shrinks each tree contribution. Smaller rates need more trees but generalize better.
  • Tree depth is kept shallow, often three to six, so each tree is a weak learner.
  • Subsampling rows and features adds randomness that fights overfitting, called stochastic gradient boosting.

Why it is powerful

By targeting the gradient of any differentiable loss, boosting handles regression, classification, and ranking with the same engine. The shrinkage and shallow trees make it accurate but slow to train and sensitive to its hyperparameters.

Key idea

Gradient boosting fits each new tree to the negative gradient of the loss, performing gradient descent in function space. A small learning rate with many shallow trees yields high accuracy on any differentiable loss.

Check yourself

Answer to earn rating on the learn ladder.

1. What does each new tree in gradient boosting fit?

2. Why use a small learning rate in gradient boosting?

3. What makes gradient boosting work for many task types?