The labeling budget problem
Labels cost money and time, so labeling everything is wasteful. Active learning lets the model choose which unlabeled examples would teach it the most, so each label buys more accuracy.
The loop
- Train a model on the current labeled set.
- Run it over the unlabeled pool and score how informative each example would be.
- Send the top scoring examples to human labelers.
- Add the new labels and retrain.
Choosing what to label
- Uncertainty sampling picks examples where the model is least confident, for example a predicted probability near the decision boundary.
- Diversity sampling avoids labeling many near duplicates by spreading picks across the data.
- Good systems blend both, since the single most uncertain points are often clustered.
Why it helps
- The model learns most from cases it currently gets wrong, so targeting those reaches a given accuracy with far fewer labels than random sampling.
A caution
- Active learning can over focus on a narrow region and ignore easy but important areas, so periodic random sampling keeps the labeled set representative.
Key idea
Active learning closes a loop where the model selects the most informative unlabeled examples to label, reaching target accuracy with fewer labels than random sampling.