Boosting built around categories
CatBoost is gradient boosting designed to handle categorical features well and to avoid a subtle leakage that hurts other boosters. Its two signature ideas are ordered target statistics and ordered boosting.
The target leakage problem
A common trick replaces a category with the mean target for that category, called target encoding. Done naively, each row uses its own label, leaking the target and inflating training accuracy. CatBoost computes these statistics using only prior rows in a random permutation, so no row sees its own label.
Ordered boosting
The same leakage can occur in computing residuals. Ordered boosting maintains models trained on prefixes of a permutation, so each row gets a residual from a model that never saw it. This reduces a prediction shift bias that standard boosting suffers.
Other traits
- It builds symmetric, also called oblivious, trees where every node at a level uses the same split, making inference very fast.
- It encodes feature combinations of categories automatically.
- Strong defaults mean it often works well with little tuning.
Key idea
CatBoost prevents target leakage with ordered target statistics computed from prior rows and with ordered boosting for residuals, while symmetric trees and native categorical handling give fast, robust models.