The Markov Decision Process
A Markov decision process, or MDP, is the standard model for reinforcement learning. It describes an agent that interacts with an environment over time, choosing actions and receiving rewards.
The five pieces
An MDP is defined by five parts:
- States describe the situation the agent is in.
- Actions are the choices available in each state.
- Transitions give the probability of the next state given the current state and action.
- Rewards are the numeric feedback after each action.
- Discount factor weighs future rewards against immediate ones.
The Markov property
The defining assumption is the Markov property: the next state and reward depend only on the current state and action, not on the full history. The present state captures everything relevant about the past.
This matters because it lets the agent plan using only the current state. If the property does not hold, the state is usually redefined to include enough history that it does.
Why it helps
By casting a problem as an MDP, we can reuse a rich set of algorithms that compute or learn good behavior. The agent and environment loop is the same across robotics, games, and recommendation.
Key idea
An MDP frames learning as states, actions, transitions, rewards, and a discount factor, with the future depending only on the present.