← Lessons

quiz vs the machine

Platinum1860

Machine Learning

Direct Preference Optimization

Aligning a model from preferences without a separate reward model.

6 min read · advanced · beat Platinum to climb

What it is

Direct preference optimization, or DPO, aligns a language model to human preferences without training a separate reward model or running reinforcement learning. It optimizes the policy directly on preference pairs with a simple classification style loss.

The idea

Classic RLHF trains a reward model, then uses PPO to maximize its reward. DPO shows that for the usual preference setup, the optimal policy can be expressed in closed form in terms of the reward, so you can skip the reward model and optimize the policy directly.

  • The loss raises the likelihood of the preferred response and lowers that of the rejected one.
  • It compares the policy against a frozen reference model to keep the update controlled, the same role the penalty plays in PPO.

Why it is attractive

  • Simpler pipeline: one training stage instead of a reward model plus reinforcement learning.
  • More stable: it is a supervised style objective, avoiding the variance of policy gradient methods.
  • Cheaper to run, with fewer moving parts to tune.

The trade off is less flexibility than full reinforcement learning, and quality still hinges on good preference data.

Key idea

DPO aligns a model directly from preference pairs with a supervised style loss against a reference model, removing the separate reward model and RL loop.

Check yourself

Answer to earn rating on the learn ladder.

1. What does DPO remove compared with classic RLHF?

2. Why does DPO compare the policy against a frozen reference model?