The problem
With many classes you get a precision, recall, and F1 per class. To report one number you must average. There are three common schemes.
The three averages
- Macro computes the metric per class then takes a plain mean. Every class counts equally regardless of size
- Micro pools all true positives, false positives, and false negatives across classes then computes once. Large classes dominate
- Weighted averages per class metrics weighted by class support, a compromise
When they differ
On imbalanced data macro and micro diverge sharply.
- Macro highlights poor performance on rare classes, since a tiny class counts as much as a huge one
- Micro reflects overall instance level accuracy and equals accuracy in single label multiclass
- Weighted sits between, biased toward frequent classes but not as extreme as micro
Choosing
If rare classes are important, report macro. If overall throughput accuracy matters, micro is fine. Always state which one you used, because the same model can look very different under each.
Key idea
Macro treats classes equally, micro treats instances equally, and weighted is a support based blend. Name your averaging scheme or the number is ambiguous.