Trees overfit by growing too deep
A decision tree can keep splitting until every leaf is pure, memorizing the training data. Pruning controls this complexity so the tree generalizes.
Two pruning styles
- Pre pruning stops growth early using limits like maximum depth, minimum samples per leaf, or a minimum impurity decrease to allow a split.
- Post pruning grows a full tree, then collapses branches that do not help on held out data.
Cost complexity pruning
The most common post pruning method adds a penalty proportional to the number of leaves. A tuning parameter called alpha controls the tradeoff. Raising alpha removes more branches, producing a sequence of nested subtrees from which cross validation picks the best.
Why prune
- A pruned tree has lower variance and reads more clearly.
- Pre pruning is cheaper but can stop too early, missing a good later split.
- Post pruning is more reliable because it judges branches by actual validation gain.
Key idea
Pre pruning halts growth early with depth and sample limits, while post pruning grows a full tree then trims weak branches via cost complexity, tuning alpha by cross validation to lower variance.