Why we need it
A model trained only to predict the next token learns what people write, not what people prefer as an answer. Reinforcement learning from human feedback, or RLHF, tunes a pretrained model so its responses are more helpful, honest, and harmless.
The three stages
RLHF typically proceeds in three stages.
- Supervised fine tuning trains the base model on high quality example responses to follow instructions
- Reward modeling collects human comparisons of pairs of responses and trains a reward model to predict which one people prefer
- Policy optimization uses reinforcement learning to update the language model so it earns higher reward, often with the algorithm PPO
Keeping it grounded
During policy optimization, a penalty keeps the tuned model close to the original so it does not drift into degenerate text that fools the reward model. This penalty is a divergence term measured against the starting policy.
Reward hacking and limits
The reward model is only a proxy for human judgment. If pushed too hard, the policy can find ways to score high reward without truly being better, a failure called reward hacking. Careful data, the divergence penalty, and ongoing evaluation help contain it.
Simpler alternatives
Newer methods such as direct preference optimization skip the separate reward model and train directly on preference pairs, achieving similar alignment with a simpler pipeline.
Key idea
RLHF aligns a model by training a reward model from human preferences and optimizing the policy toward it while staying close to the base.