The Markov Decision Process Deep Dive

The formal frame that turns sequential decision making into a solvable mathematical object.

The five ingredients

A Markov Decision Process (MDP) describes an agent acting in an environment over time. It has five parts:

A set of states S that the world can be in.
A set of actions A the agent may take.
A transition function that gives the probability of the next state given the current state and action.
A reward function that scores each transition.
A discount factor gamma between 0 and 1 that values future rewards less than immediate ones.

The Markov property

The defining assumption is that the next state depends only on the current state and action, not the full history. This memoryless property is what makes the MDP tractable. If the world truly needs history, you fold that history into the state itself.

The goal

The agent wants a policy, a rule mapping states to actions, that maximizes the expected discounted return, the sum of rewards each scaled by gamma raised to its time step. Discounting keeps infinite-horizon sums finite and expresses a preference for sooner rewards.

Key idea

An MDP packages sequential decisions into states, actions, transitions, rewards, and a discount, and the Markov property makes finding an optimal policy a well defined problem.

The Markov Decision Process Deep Dive

The five ingredients

The Markov property

The goal

Key idea

Check yourself