← Lessons

quiz vs the machine

Platinum1760

Machine Learning

The Exploration Strategies Deep Dive

How agents balance trying new actions against exploiting known good ones.

6 min read · advanced · beat Platinum to climb

The exploration exploitation tradeoff

An agent that always exploits its current best guess may never discover better options, while one that always explores never cashes in. Good exploration balances gathering information against earning reward.

Common strategies

  • Epsilon greedy acts greedily most of the time but picks a random action with small probability, often decayed over training. Simple but undirected.
  • Boltzmann sampling chooses actions in proportion to the exponential of their values, exploring more among similarly good actions.
  • Optimism under uncertainty initializes values high so untried actions look attractive until disproven.

Directed exploration

More sophisticated agents seek informative experiences rather than random ones:

  • Upper confidence bound methods add an exploration bonus that shrinks as a state action pair is visited more.
  • Intrinsic motivation rewards novelty or prediction error, driving agents toward unfamiliar states in sparse reward settings.
  • Count based bonuses reward rarely seen states, generalizing optimism to large spaces.

The right choice depends on how sparse rewards are and how large the state space is.

Key idea

Exploration strategies range from simple undirected methods like epsilon greedy to directed schemes using confidence bonuses, optimism, and intrinsic novelty rewards, chosen by reward sparsity and state space size.

Check yourself

Answer to earn rating on the learn ladder.

1. What does epsilon greedy do?

2. How do upper confidence bound methods explore?

3. When is intrinsic motivation especially useful?