Why combine two metrics
Precision and recall each tell only half the story, and they trade off against each other. Reporting both is honest but awkward when you need to rank models or tune a threshold. The F1 score folds them into a single value.
The harmonic mean
F1 is the harmonic mean of precision and recall, not the ordinary average. The harmonic mean is dominated by the smaller of the two numbers:
- If precision is high but recall is near zero, F1 stays near zero
- F1 is only high when both precision and recall are high
- This punishes models that cheat one metric while ignoring the other
Variants
The general F beta score weights recall more heavily when beta is above one, useful when missing positives is costly, and weights precision more when beta is below one. On imbalanced data F1 is far more informative than plain accuracy.
Key idea
The F1 score is the harmonic mean of precision and recall, rewarding models only when both are strong, with F beta tilting the balance toward one or the other.