The RLHF vs DPO Comparison

Aligning to preferences

After instruction tuning, models are often refined to match human preferences about which of two responses is better. Two leading methods are reinforcement learning from human feedback and direct preference optimization.

RLHF in three stages

RLHF trains a separate reward model on human preference comparisons, then uses reinforcement learning to update the policy to maximize that reward, with a penalty keeping it close to the original model.

Collect preference pairs.
Train a reward model to score responses.
Optimize the policy with RL against the reward.

DPO in one stage

DPO skips the explicit reward model and RL loop. It derives a loss that directly raises the probability of preferred responses and lowers that of rejected ones, using the same preference pairs.

Trade offs

DPO is simpler and more stable because it is plain supervised style optimization with no reward model or RL tuning. RLHF is more complex but its separate reward model can be reused and can capture richer signals, and it remains common in large scale alignment pipelines.

Key idea

RLHF trains a reward model and optimizes a policy with reinforcement learning, while DPO optimizes preferences directly with a single stable loss, trading flexibility for simplicity.

The RLHF vs DPO Comparison

Aligning to preferences

RLHF in three stages

DPO in one stage

Trade offs

Key idea

Check yourself