Monte Carlo Methods

Monte Carlo methods learn value functions directly from experience, with no model of the environment. They estimate the value of a state as the average return seen after visiting it.

Learning from returns

The idea is simple: play full episodes until they end, then for each state compute the actual return that followed and average those returns over many episodes.

The estimate needs no transition probabilities, only sampled experience.
It is unbiased, since it uses the real observed return.

As more episodes accumulate, the averages converge to the true expected return under the policy.

Episodes required

Because Monte Carlo waits for a return, it only works in tasks that terminate. You cannot update until the episode is over and the total reward is known. This makes it unsuitable for never ending tasks.

First visit and every visit

Two variants differ in how they count a state within an episode. First visit averages returns only from the first time a state appears, while every visit averages over all appearances. Both converge to the same value.

Strengths and limits

Monte Carlo is conceptually clean and model free, but it has high variance because a return depends on the whole episode, and it learns slowly since updates wait for episode ends.

Key idea

Monte Carlo methods estimate values by averaging the actual returns observed after visiting states, learning from complete episodes without any model.

Monte Carlo Methods