The DPO Direct Preference Optimization

How DPO aligns a model from preferences without a separate reward model or RL loop.

Skipping the RL loop

Direct preference optimization (DPO) reaches a similar goal as RLHF but without training a separate reward model or running PPO. It optimizes the policy directly on preference pairs with a simple classification style loss.

The key insight

RLHF's optimal policy has a closed form relationship to the reward and a reference model.
DPO inverts this: it reparameterizes the reward in terms of the policy itself.
The result is a loss that pushes up the probability of chosen responses and pushes down rejected ones, relative to a frozen reference.

What the loss does

It compares the policy and reference log probabilities for chosen versus rejected responses.
A temperature like coefficient controls how far the policy may move from the reference, playing a role similar to the KL penalty in RLHF.

Why teams like it

No reward model to train, fewer moving parts, and more stable training.
Uses the same preference data RLHF would use.

Trade offs

DPO can be more sensitive to the quality and coverage of preference pairs.
It lacks the online exploration of RL, so it only learns from the responses already in the dataset.

Key idea

DPO aligns a model directly from preference pairs by reparameterizing the reward in terms of the policy, giving RLHF like results with no separate reward model or RL loop, at the cost of no online exploration.