The need for a reward signal
To improve beyond imitation, alignment needs a way to score arbitrary responses by quality. A reward model learns this from human comparisons.
Collecting preferences
- For a given prompt, the model generates two or more candidate responses.
- Human labelers pick which response is better along helpfulness and safety.
- This yields pairs of chosen and rejected responses.
The training objective
- The reward model takes a prompt and a response and outputs a single scalar.
- It is trained so the chosen response scores higher than the rejected one, using a Bradley Terry style logistic loss on the score difference.
- Absolute scores are not calibrated, only relative ordering matters.
Why comparisons not ratings
- People disagree wildly on numeric scores but agree more on which of two answers is better.
- Pairwise comparisons are easier to collect consistently and reduce labeler noise.
Failure modes
- A reward model can be gamed: the policy may find responses that score high but are actually poor, an early sign of reward hacking.
Key idea
A reward model learns a scalar quality score from pairwise human preferences using a logistic loss on chosen versus rejected responses, capturing relative ordering rather than absolute ratings.