The Reward Model in RLHF

What it is

In reinforcement learning from human feedback, a reward model is a network that scores how good a model's response is according to human preference. It stands in for a human rater so the policy can be trained against millions of generations.

How it is trained

The reward model learns from comparison data.

Humans see two responses to the same prompt and pick the better one.
The reward model is trained so the preferred response gets a higher score than the rejected one.
The loss is a ranking loss over these pairs, not an absolute target.

This sidesteps the difficulty of asking humans for precise numeric scores, since relative judgments are easier and more consistent.

How it drives training

Once trained, the reward model scores fresh responses from the policy. A reinforcement learning algorithm such as PPO updates the policy to raise reward, while a penalty keeps it close to the base model so it does not drift.

A real risk is reward hacking: the policy finds outputs that score high yet are not truly better, like padding or flattery, so the reward model must be monitored and refreshed.

Key idea

The RLHF reward model learns from human preference comparisons to score responses, guiding policy training while reward hacking must be watched for.

The Reward Model in RLHF

What it is

How it is trained

How it drives training

Key idea

Check yourself