Beyond simple averaging
Ensembles improve accuracy by combining models, and the simplest combination just averages or votes. Stacking, short for stacked generalization, goes further by learning how to combine base models with a second model called the meta learner.
How stacking works
The base models should be diverse, for example a tree ensemble, a linear model, and a neural network, so their errors differ:
- Each base model makes predictions
- A meta model takes those predictions as its input features
- The meta model learns which base models to trust in which situations
Because a base model is naturally good at predicting data it trained on, stacking uses out of fold predictions, generated through cross validation, so the meta learner sees honest estimates rather than memorized answers.
When to use it
Stacking often wins competitions by squeezing out the last bit of accuracy, but it adds complexity and risk of overfitting the meta layer. Keeping the meta model simple, like a regularized linear model, usually works best.
Key idea
Stacking trains a meta model on the out of fold predictions of diverse base models, learning a smarter combination than plain averaging.