Bias is learned, then amplified
Language models absorb the statistical regularities of their training text, including human social biases about gender, race, religion, and more. These can surface as stereotyped or unequal outputs.
How bias enters
- Data: web text over represents some groups and viewpoints and encodes historical prejudice.
- Objective: predicting likely text reproduces majority associations, so stereotypes that are common in data become likely outputs.
- Feedback: labelers and preference data carry their own biases into alignment.
How it surfaces
- Allocative harm: skewed help, such as worse answers for some dialects or names.
- Representational harm: stereotyped or demeaning depictions of groups.
- Disparities in refusal rates, sentiment, or quality across demographic mentions.
Measuring and mitigating
- Use counterfactual tests: swap a demographic term and check whether the output changes unfairly.
- Track disaggregated metrics per group rather than a single average.
- Mitigations include data curation, balanced fine tuning, and targeted preference data, but no method fully removes bias.
A caution
- Debiasing one benchmark can mask bias elsewhere, so claims of an unbiased model should be treated skeptically.
Key idea
Language models inherit social bias from data, objective, and feedback, surfacing as allocative and representational harms, so counterfactual and disaggregated testing is needed because no single fix fully removes bias.