Monte Carlo Methods
Monte Carlo methods learn value functions directly from experience, with no model of the environment. They estimate the value of a state as the average return seen after visiting it.
Learning from returns
The idea is simple: play full episodes until they end, then for each state compute the actual return that followed and average those returns over many episodes.
- The estimate needs no transition probabilities, only sampled experience.
- It is unbiased, since it uses the real observed return.
As more episodes accumulate, the averages converge to the true expected return under the policy.
Episodes required
Because Monte Carlo waits for a return, it only works in tasks that terminate. You cannot update until the episode is over and the total reward is known. This makes it unsuitable for never ending tasks.
First visit and every visit
Two variants differ in how they count a state within an episode. First visit averages returns only from the first time a state appears, while every visit averages over all appearances. Both converge to the same value.
Strengths and limits
Monte Carlo is conceptually clean and model free, but it has high variance because a return depends on the whole episode, and it learns slowly since updates wait for episode ends.
Key idea
Monte Carlo methods estimate values by averaging the actual returns observed after visiting states, learning from complete episodes without any model.