Gini Impurity and Entropy
To pick splits a tree needs a number for how mixed a node is. Gini impurity and entropy are the two standard impurity measures, both lowest when a node holds a single class.
Gini impurity
Gini equals one minus the sum of squared class proportions. If a node is all one class the squared proportion is one, so Gini is zero. A perfectly even two class split gives Gini of one half, the maximum for two classes.
Entropy
Entropy sums the proportion of each class times the negative log of that proportion. It is also zero for a pure node and peaks when classes are evenly mixed. The reduction in entropy from a split is called information gain.
How they compare
- Both reward purity and punish mixed nodes.
- Gini is slightly cheaper because it avoids logarithms.
- They usually pick the same or very similar splits, so the choice rarely changes the final tree much.
Using the measure
For each candidate split the tree computes the weighted average impurity of the children and subtracts it from the parent impurity. The split with the biggest drop wins.
Key idea
Gini and entropy both score label mixing, are zero for pure nodes, and usually agree on the best split.