← Lessons

quiz vs the machine

Platinum1880

Machine Learning

RLHF Basics

Aligning a language model to human preferences with a learned reward.

6 min read · advanced · beat Platinum to climb

Why we need it

A model trained only to predict the next token learns what people write, not what people prefer as an answer. Reinforcement learning from human feedback, or RLHF, tunes a pretrained model so its responses are more helpful, honest, and harmless.

The three stages

RLHF typically proceeds in three stages.

  • Supervised fine tuning trains the base model on high quality example responses to follow instructions
  • Reward modeling collects human comparisons of pairs of responses and trains a reward model to predict which one people prefer
  • Policy optimization uses reinforcement learning to update the language model so it earns higher reward, often with the algorithm PPO

Keeping it grounded

During policy optimization, a penalty keeps the tuned model close to the original so it does not drift into degenerate text that fools the reward model. This penalty is a divergence term measured against the starting policy.

Reward hacking and limits

The reward model is only a proxy for human judgment. If pushed too hard, the policy can find ways to score high reward without truly being better, a failure called reward hacking. Careful data, the divergence penalty, and ongoing evaluation help contain it.

Simpler alternatives

Newer methods such as direct preference optimization skip the separate reward model and train directly on preference pairs, achieving similar alignment with a simpler pipeline.

Key idea

RLHF aligns a model by training a reward model from human preferences and optimizing the policy toward it while staying close to the base.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the reward model in RLHF predict?

2. Why include a penalty that keeps the policy close to the base model?

3. What does direct preference optimization remove from the pipeline?