Epsilon Greedy and Softmax
Knowing that you must explore is not enough; you need a concrete rule. Epsilon greedy and softmax are two of the simplest and most widely used exploration strategies.
Epsilon greedy
Epsilon greedy is blunt and effective:
- With probability epsilon, pick a random action.
- Otherwise, pick the action with the highest estimated value.
The parameter epsilon sets the exploration rate. It is often decayed over time so the agent explores early and exploits later. Its weakness is that random exploration treats all non greedy actions equally, even clearly bad ones.
Softmax
Softmax exploration chooses actions in proportion to their estimated value, passing the values through a softmax. Higher value actions are more likely, but every action keeps a chance.
- A temperature parameter controls how sharp the distribution is.
- High temperature makes choices nearly uniform; low temperature makes them nearly greedy.
This is smarter than epsilon greedy because exploration focuses on promising actions rather than picking uniformly at random.
Choosing between them
Epsilon greedy is trivial to implement and tune, which is why it dominates in practice. Softmax can explore more efficiently when action values differ widely but needs the temperature tuned.
Key idea
Epsilon greedy explores by occasionally acting at random, while softmax explores by weighting actions by their estimated value, both trading off discovery against reward.