The five ingredients
A Markov Decision Process (MDP) describes an agent acting in an environment over time. It has five parts:
- A set of states S that the world can be in.
- A set of actions A the agent may take.
- A transition function that gives the probability of the next state given the current state and action.
- A reward function that scores each transition.
- A discount factor gamma between 0 and 1 that values future rewards less than immediate ones.
The Markov property
The defining assumption is that the next state depends only on the current state and action, not the full history. This memoryless property is what makes the MDP tractable. If the world truly needs history, you fold that history into the state itself.
The goal
The agent wants a policy, a rule mapping states to actions, that maximizes the expected discounted return, the sum of rewards each scaled by gamma raised to its time step. Discounting keeps infinite-horizon sums finite and expresses a preference for sooner rewards.
Key idea
An MDP packages sequential decisions into states, actions, transitions, rewards, and a discount, and the Markov property makes finding an optimal policy a well defined problem.