The Multi Armed Bandit for Ranking

The bandit setup

Imagine a row of slot machines, each with an unknown payout. A multi armed bandit must decide which arm to pull to maximize total reward, balancing trying arms it knows little about against pulling the best known one. In ranking, each item or option is an arm, and a click is a reward.

Regret, the thing to minimize

Bandits are judged by regret: the reward lost compared to always pulling the truly best arm. Good algorithms drive regret to grow slowly over time as estimates sharpen.

Classic strategies

Epsilon greedy picks the best arm most of the time and a random arm a small fraction, simple but blunt.
Upper confidence bound picks the arm with the highest optimistic estimate, exploring arms whose uncertainty is large.
Thompson sampling keeps a probability distribution over each arm's value and samples from it, exploring in proportion to uncertainty.

Why bandits fit recsys

They handle the explore exploit tradeoff principled, not by guesswork.
They adapt as feedback arrives, ideal for fresh items and shifting tastes.
They reduce the impressions wasted compared to naive random exploration.

A caution

Plain bandits ignore context. Two users may value the same item differently, which motivates the contextual extension.

Key idea

A multi armed bandit chooses items to minimize regret, using strategies like upper confidence bound and Thompson sampling to balance exploration and exploitation as feedback arrives.