Beyond a fixed split
A classic A B test holds a fixed split until it ends, sending half the traffic to a possibly worse model the whole time. A multi armed bandit instead adapts, routing more traffic to whichever model looks best so far while still exploring the others.
The explore exploit tradeoff
- Exploit, send traffic to the current best to maximize reward now.
- Explore, keep sampling other models so a true winner is not missed by chance.
Bandits balance these automatically as data accumulates.
Common strategies
- Epsilon greedy, pick the best most of the time, a random arm with small probability.
- Thompson sampling, sample from each arm's reward posterior and pick the winner, naturally balancing exploration.
- Upper confidence bound, prefer arms with high mean or high uncertainty.
Tradeoffs versus A B testing
- Bandits cut regret, the reward lost to inferior arms, by shifting traffic early.
- But adaptive allocation complicates clean statistical inference about the exact effect size.
Use bandits when maximizing live reward matters more than a precise measurement.
Key idea
A multi armed bandit adaptively routes traffic toward the best performing model while exploring others, reducing regret at the cost of cleaner statistical inference than a fixed A B test.