RLHF Basics

Why we need it

A model trained only to predict the next token learns what people write, not what people prefer as an answer. Reinforcement learning from human feedback, or RLHF, tunes a pretrained model so its responses are more helpful, honest, and harmless.

The three stages

RLHF typically proceeds in three stages.

Supervised fine tuning trains the base model on high quality example responses to follow instructions
Reward modeling collects human comparisons of pairs of responses and trains a reward model to predict which one people prefer
Policy optimization uses reinforcement learning to update the language model so it earns higher reward, often with the algorithm PPO

Keeping it grounded

During policy optimization, a penalty keeps the tuned model close to the original so it does not drift into degenerate text that fools the reward model. This penalty is a divergence term measured against the starting policy.

Reward hacking and limits

The reward model is only a proxy for human judgment. If pushed too hard, the policy can find ways to score high reward without truly being better, a failure called reward hacking. Careful data, the divergence penalty, and ongoing evaluation help contain it.

Simpler alternatives

Newer methods such as direct preference optimization skip the separate reward model and train directly on preference pairs, achieving similar alignment with a simpler pipeline.

Key idea