The Multi Armed Bandit Deployment

Shifting traffic toward the winning model as evidence arrives instead of waiting.

Beyond a fixed split

A classic A B test holds a fixed split until it ends, sending half the traffic to a possibly worse model the whole time. A multi armed bandit instead adapts, routing more traffic to whichever model looks best so far while still exploring the others.

The explore exploit tradeoff

Exploit, send traffic to the current best to maximize reward now.
Explore, keep sampling other models so a true winner is not missed by chance.

Bandits balance these automatically as data accumulates.

Common strategies

Epsilon greedy, pick the best most of the time, a random arm with small probability.
Thompson sampling, sample from each arm's reward posterior and pick the winner, naturally balancing exploration.
Upper confidence bound, prefer arms with high mean or high uncertainty.

Tradeoffs versus A B testing

Bandits cut regret, the reward lost to inferior arms, by shifting traffic early.
But adaptive allocation complicates clean statistical inference about the exact effect size.

Use bandits when maximizing live reward matters more than a precise measurement.

Key idea