Macro Micro and Weighted Averaging

Three ways to roll per class scores into one number, each telling a different story.

The problem

With many classes you get a precision, recall, and F1 per class. To report one number you must average. There are three common schemes.

The three averages

Macro computes the metric per class then takes a plain mean. Every class counts equally regardless of size
Micro pools all true positives, false positives, and false negatives across classes then computes once. Large classes dominate
Weighted averages per class metrics weighted by class support, a compromise

When they differ

On imbalanced data macro and micro diverge sharply.

Macro highlights poor performance on rare classes, since a tiny class counts as much as a huge one
Micro reflects overall instance level accuracy and equals accuracy in single label multiclass
Weighted sits between, biased toward frequent classes but not as extreme as micro

Choosing

If rare classes are important, report macro. If overall throughput accuracy matters, micro is fine. Always state which one you used, because the same model can look very different under each.

Key idea