← Lessons

quiz vs the machine

Gold1490

Machine Learning

The RLHF vs DPO Comparison

Two ways to align a model to human preferences over outputs.

6 min read · core · beat Gold to climb

Aligning to preferences

After instruction tuning, models are often refined to match human preferences about which of two responses is better. Two leading methods are reinforcement learning from human feedback and direct preference optimization.

RLHF in three stages

RLHF trains a separate reward model on human preference comparisons, then uses reinforcement learning to update the policy to maximize that reward, with a penalty keeping it close to the original model.

  • Collect preference pairs.
  • Train a reward model to score responses.
  • Optimize the policy with RL against the reward.

DPO in one stage

DPO skips the explicit reward model and RL loop. It derives a loss that directly raises the probability of preferred responses and lowers that of rejected ones, using the same preference pairs.

Trade offs

DPO is simpler and more stable because it is plain supervised style optimization with no reward model or RL tuning. RLHF is more complex but its separate reward model can be reused and can capture richer signals, and it remains common in large scale alignment pipelines.

Key idea

RLHF trains a reward model and optimizes a policy with reinforcement learning, while DPO optimizes preferences directly with a single stable loss, trading flexibility for simplicity.

Check yourself

Answer to earn rating on the learn ladder.

1. What extra component does RLHF require that DPO does not?

2. What is a common advantage of DPO over RLHF?