← Lessons

quiz vs the machine

Silver1050

Machine Learning

The Markov Decision Process Deep Dive

The formal frame that turns sequential decision making into a solvable mathematical object.

5 min read · intro · beat Silver to climb

The five ingredients

A Markov Decision Process (MDP) describes an agent acting in an environment over time. It has five parts:

  • A set of states S that the world can be in.
  • A set of actions A the agent may take.
  • A transition function that gives the probability of the next state given the current state and action.
  • A reward function that scores each transition.
  • A discount factor gamma between 0 and 1 that values future rewards less than immediate ones.

The Markov property

The defining assumption is that the next state depends only on the current state and action, not the full history. This memoryless property is what makes the MDP tractable. If the world truly needs history, you fold that history into the state itself.

The goal

The agent wants a policy, a rule mapping states to actions, that maximizes the expected discounted return, the sum of rewards each scaled by gamma raised to its time step. Discounting keeps infinite-horizon sums finite and expresses a preference for sooner rewards.

Key idea

An MDP packages sequential decisions into states, actions, transitions, rewards, and a discount, and the Markov property makes finding an optimal policy a well defined problem.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the Markov property assume?

2. Why is a discount factor gamma less than one useful?