The exploration exploitation tradeoff
An agent that always exploits its current best guess may never discover better options, while one that always explores never cashes in. Good exploration balances gathering information against earning reward.
Common strategies
- Epsilon greedy acts greedily most of the time but picks a random action with small probability, often decayed over training. Simple but undirected.
- Boltzmann sampling chooses actions in proportion to the exponential of their values, exploring more among similarly good actions.
- Optimism under uncertainty initializes values high so untried actions look attractive until disproven.
Directed exploration
More sophisticated agents seek informative experiences rather than random ones:
- Upper confidence bound methods add an exploration bonus that shrinks as a state action pair is visited more.
- Intrinsic motivation rewards novelty or prediction error, driving agents toward unfamiliar states in sparse reward settings.
- Count based bonuses reward rarely seen states, generalizing optimism to large spaces.
The right choice depends on how sparse rewards are and how large the state space is.
Key idea
Exploration strategies range from simple undirected methods like epsilon greedy to directed schemes using confidence bonuses, optimism, and intrinsic novelty rewards, chosen by reward sparsity and state space size.