← Lessons

quiz vs the machine

Gold1450

Machine Learning

The DPO Direct Preference Optimization

How DPO aligns a model from preferences without a separate reward model or RL loop.

6 min read · core · beat Gold to climb

Skipping the RL loop

Direct preference optimization (DPO) reaches a similar goal as RLHF but without training a separate reward model or running PPO. It optimizes the policy directly on preference pairs with a simple classification style loss.

The key insight

  • RLHF's optimal policy has a closed form relationship to the reward and a reference model.
  • DPO inverts this: it reparameterizes the reward in terms of the policy itself.
  • The result is a loss that pushes up the probability of chosen responses and pushes down rejected ones, relative to a frozen reference.

What the loss does

  • It compares the policy and reference log probabilities for chosen versus rejected responses.
  • A temperature like coefficient controls how far the policy may move from the reference, playing a role similar to the KL penalty in RLHF.

Why teams like it

  • No reward model to train, fewer moving parts, and more stable training.
  • Uses the same preference data RLHF would use.

Trade offs

  • DPO can be more sensitive to the quality and coverage of preference pairs.
  • It lacks the online exploration of RL, so it only learns from the responses already in the dataset.

Key idea

DPO aligns a model directly from preference pairs by reparameterizing the reward in terms of the policy, giving RLHF like results with no separate reward model or RL loop, at the cost of no online exploration.

Check yourself

Answer to earn rating on the learn ladder.

1. What does DPO eliminate compared to classic RLHF?

2. What role does the reference model play in DPO?

3. What is a trade off of DPO versus RLHF?