The Reward Model Training

The need for a reward signal

To improve beyond imitation, alignment needs a way to score arbitrary responses by quality. A reward model learns this from human comparisons.

Collecting preferences

For a given prompt, the model generates two or more candidate responses.
Human labelers pick which response is better along helpfulness and safety.
This yields pairs of chosen and rejected responses.

The training objective

The reward model takes a prompt and a response and outputs a single scalar.
It is trained so the chosen response scores higher than the rejected one, using a Bradley Terry style logistic loss on the score difference.
Absolute scores are not calibrated, only relative ordering matters.

Why comparisons not ratings

People disagree wildly on numeric scores but agree more on which of two answers is better.
Pairwise comparisons are easier to collect consistently and reduce labeler noise.

Failure modes

A reward model can be gamed: the policy may find responses that score high but are actually poor, an early sign of reward hacking.

Key idea

A reward model learns a scalar quality score from pairwise human preferences using a logistic loss on chosen versus rejected responses, capturing relative ordering rather than absolute ratings.