The three stages
Reinforcement learning from human feedback (RLHF) is the classic recipe that combines the earlier pieces:
- Start from a supervised fine tuned policy.
- Train a reward model from human preference comparisons.
- Optimize the policy with reinforcement learning to maximize reward.
The RL step
- The policy generates responses, the reward model scores them, and an algorithm like PPO updates the policy toward higher reward.
- A KL penalty keeps the policy close to the SFT model so it does not drift into degenerate text that games the reward.
Why the KL term matters
- Without it, the policy can collapse onto a few high scoring but odd outputs.
- The penalty balances reward gain against staying near a sane reference distribution.
Practical difficulties
- RLHF is unstable and compute heavy, with many sensitive hyperparameters.
- The reward model is a proxy, so over optimizing it causes reward hacking and quality regressions.
- Human feedback is expensive and can encode labeler biases.
Key idea
RLHF optimizes an SFT policy against a learned reward model using PPO with a KL penalty, balancing reward maximization against staying close to a sane reference to avoid reward hacking.