The idea
Active learning reduces labeling cost by letting the model pick which examples to label next. Instead of labeling data at random, you label the examples the model would learn the most from, so you reach good accuracy with far fewer labels.
The loop
Active learning runs as a cycle between the model and a human labeler.
- Train the model on the small set of labels you have so far
- Use the model to score a large pool of unlabeled data
- Select the most informative examples by a query strategy
- Send those to a human to label, then add them and repeat
Query strategies
The query strategy decides what is most informative.
- Uncertainty sampling picks examples the model is least confident about
- Query by committee trains several models and picks examples they disagree on
- Diversity methods avoid picking many near identical points
When it helps and its risks
Active learning shines when unlabeled data is plentiful but labeling is expensive, such as medical images. A risk is sampling bias, since the model only sees the points it chose, which can skew the training set. Mixing in some random samples helps keep the data representative.
Key idea
Active learning queries the most informative unlabeled examples for labeling, reaching high accuracy with fewer labels.