The Learning Rate in Boosting
In gradient boosting each new tree is scaled by a learning rate before being added. This number, also called shrinkage, controls how much any single tree changes the model.
Why shrink each tree
If every tree were added at full strength the model would lurch toward early trees and overfit quickly. A small learning rate makes each step cautious, so the ensemble improves gradually and stays smoother.
- A large rate means fewer trees but a higher risk of overfitting.
- A small rate means more trees are needed but generalization usually improves.
The trade with tree count
Learning rate and the number of trees move together. Halving the rate roughly doubles the trees needed to reach the same training fit. The common recipe is to set a small rate, such as a fraction of one, and add many trees while watching validation error.
Early stopping
Because adding trees slowly approaches and then passes the best point, training watches a validation set and stops when error stops improving. This finds a good tree count automatically instead of guessing.
Key idea
A small learning rate shrinks each tree so boosting takes many cautious steps that generalize better.