Aligning to preferences
After instruction tuning, models are often refined to match human preferences about which of two responses is better. Two leading methods are reinforcement learning from human feedback and direct preference optimization.
RLHF in three stages
RLHF trains a separate reward model on human preference comparisons, then uses reinforcement learning to update the policy to maximize that reward, with a penalty keeping it close to the original model.
- Collect preference pairs.
- Train a reward model to score responses.
- Optimize the policy with RL against the reward.
DPO in one stage
DPO skips the explicit reward model and RL loop. It derives a loss that directly raises the probability of preferred responses and lowers that of rejected ones, using the same preference pairs.
Trade offs
DPO is simpler and more stable because it is plain supervised style optimization with no reward model or RL tuning. RLHF is more complex but its separate reward model can be reused and can capture richer signals, and it remains common in large scale alignment pipelines.
Key idea
RLHF trains a reward model and optimizes a policy with reinforcement learning, while DPO optimizes preferences directly with a single stable loss, trading flexibility for simplicity.