Speed on large datasets
LightGBM is a gradient boosting library built for large data. It pairs histogram based splits with three ideas that cut work without much accuracy loss.
Leaf wise tree growth
Most boosters grow trees level by level. LightGBM grows leaf wise, always splitting the leaf with the largest loss reduction. This reaches lower loss with fewer leaves but can overfit, so a max leaves limit is essential.
Two signature techniques
- Gradient based one side sampling, called GOSS, keeps all rows with large gradients and randomly samples the small gradient rows, focusing effort where error is high.
- Exclusive feature bundling, called EFB, merges sparse features that rarely take nonzero values together, shrinking the effective feature count.
Tuning notes
- Control complexity mainly with num leaves rather than depth, since growth is leaf wise.
- Increase min data in leaf to fight the overfitting that leaf wise growth invites.
- It handles categorical features natively without one hot encoding.
Key idea
LightGBM grows trees leaf wise for lower loss per leaf, then speeds training with GOSS sampling and exclusive feature bundling. Control overfitting through num leaves and min data in leaf.