Defense outside the model
Even an aligned model benefits from moderation classifiers that sit on the inputs and outputs. They act as an independent safety layer that does not depend on the model behaving perfectly.
What gets checked
- Input moderation flags prompts that request clearly disallowed content before generation.
- Output moderation scans generated text for policy violations before it reaches the user.
- Categories typically include violence, sexual content, self harm, hate, and illegal activity.
Why a separate layer
- Defense in depth: if the model is jailbroken, the filter can still catch the unsafe output.
- Filters can be updated quickly without retraining the large model.
- They provide an auditable log of decisions for policy and appeals.
The threshold trade off
- A strict filter blocks more harm but raises false positives, frustrating legitimate users.
- A loose filter is permissive but misses real harm.
- Different surfaces and audiences may justify different thresholds.
Key idea
Moderation classifiers on inputs and outputs add an independent, quickly updatable safety layer around the model, trading off false positives against missed harm via tunable thresholds.