← Lessons

quiz vs the machine

Silver1100

Machine Learning

The Reward Model Training

How human preference comparisons become a scalar score for responses.

5 min read · intro · beat Silver to climb

The need for a reward signal

To improve beyond imitation, alignment needs a way to score arbitrary responses by quality. A reward model learns this from human comparisons.

Collecting preferences

  • For a given prompt, the model generates two or more candidate responses.
  • Human labelers pick which response is better along helpfulness and safety.
  • This yields pairs of chosen and rejected responses.

The training objective

  • The reward model takes a prompt and a response and outputs a single scalar.
  • It is trained so the chosen response scores higher than the rejected one, using a Bradley Terry style logistic loss on the score difference.
  • Absolute scores are not calibrated, only relative ordering matters.

Why comparisons not ratings

  • People disagree wildly on numeric scores but agree more on which of two answers is better.
  • Pairwise comparisons are easier to collect consistently and reduce labeler noise.

Failure modes

  • A reward model can be gamed: the policy may find responses that score high but are actually poor, an early sign of reward hacking.

Key idea

A reward model learns a scalar quality score from pairwise human preferences using a logistic loss on chosen versus rejected responses, capturing relative ordering rather than absolute ratings.

Check yourself

Answer to earn rating on the learn ladder.

1. Why are pairwise comparisons preferred over absolute numeric ratings?

2. What does the reward model output for a prompt and response?

3. What is reward hacking in this context?