← Lessons

quiz vs the machine

Gold1400

Machine Learning

The RLHF Pipeline

How reinforcement learning from human feedback ties the alignment stages together.

6 min read · core · beat Gold to climb

The three stages

Reinforcement learning from human feedback (RLHF) is the classic recipe that combines the earlier pieces:

  • Start from a supervised fine tuned policy.
  • Train a reward model from human preference comparisons.
  • Optimize the policy with reinforcement learning to maximize reward.

The RL step

  • The policy generates responses, the reward model scores them, and an algorithm like PPO updates the policy toward higher reward.
  • A KL penalty keeps the policy close to the SFT model so it does not drift into degenerate text that games the reward.

Why the KL term matters

  • Without it, the policy can collapse onto a few high scoring but odd outputs.
  • The penalty balances reward gain against staying near a sane reference distribution.

Practical difficulties

  • RLHF is unstable and compute heavy, with many sensitive hyperparameters.
  • The reward model is a proxy, so over optimizing it causes reward hacking and quality regressions.
  • Human feedback is expensive and can encode labeler biases.

Key idea

RLHF optimizes an SFT policy against a learned reward model using PPO with a KL penalty, balancing reward maximization against staying close to a sane reference to avoid reward hacking.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the role of the KL penalty in RLHF?

2. Which algorithm is classically used for the RL optimization step?

3. Why does over optimizing the reward model hurt quality?