The RLHF Pipeline

How reinforcement learning from human feedback ties the alignment stages together.

The three stages

Reinforcement learning from human feedback (RLHF) is the classic recipe that combines the earlier pieces:

Start from a supervised fine tuned policy.
Train a reward model from human preference comparisons.
Optimize the policy with reinforcement learning to maximize reward.

The RL step

The policy generates responses, the reward model scores them, and an algorithm like PPO updates the policy toward higher reward.
A KL penalty keeps the policy close to the SFT model so it does not drift into degenerate text that games the reward.

Why the KL term matters

Without it, the policy can collapse onto a few high scoring but odd outputs.
The penalty balances reward gain against staying near a sane reference distribution.

Practical difficulties

RLHF is unstable and compute heavy, with many sensitive hyperparameters.
The reward model is a proxy, so over optimizing it causes reward hacking and quality regressions.
Human feedback is expensive and can encode labeler biases.

Key idea

RLHF optimizes an SFT policy against a learned reward model using PPO with a KL penalty, balancing reward maximization against staying close to a sane reference to avoid reward hacking.

The three stages

The RL step

Why the KL term matters

Practical difficulties

Key idea

Check yourself