What it is
Direct preference optimization, or DPO, aligns a language model to human preferences without training a separate reward model or running reinforcement learning. It optimizes the policy directly on preference pairs with a simple classification style loss.
The idea
Classic RLHF trains a reward model, then uses PPO to maximize its reward. DPO shows that for the usual preference setup, the optimal policy can be expressed in closed form in terms of the reward, so you can skip the reward model and optimize the policy directly.
- The loss raises the likelihood of the preferred response and lowers that of the rejected one.
- It compares the policy against a frozen reference model to keep the update controlled, the same role the penalty plays in PPO.
Why it is attractive
- Simpler pipeline: one training stage instead of a reward model plus reinforcement learning.
- More stable: it is a supervised style objective, avoiding the variance of policy gradient methods.
- Cheaper to run, with fewer moving parts to tune.
The trade off is less flexibility than full reinforcement learning, and quality still hinges on good preference data.
Key idea
DPO aligns a model directly from preference pairs with a supervised style loss against a reference model, removing the separate reward model and RL loop.