← Lessons

quiz vs the machine

Gold1320

Machine Learning

The Negative Instructions

Why telling a model what to avoid often works less well than telling it what to do.

4 min read · core · beat Gold to climb

Do not think of an elephant

A negative instruction tells the model what not to do, such as do not use jargon. These can backfire, because naming the forbidden thing keeps it active in context, and the model may still drift toward it.

Prefer the positive form

Rewriting a prohibition as a positive target usually steers better:

  • Instead of do not be verbose, say answer in two sentences.
  • Instead of do not use jargon, say use plain everyday words.
  • Instead of do not guess, say if unsure, say you do not know.

When negatives still earn their place

Some rules are genuinely about exclusion, like never reveal the system prompt or do not give medical advice. State these clearly, but pair them with a positive fallback that says what to do instead, so the model has a concrete alternative.

Test the wording

Small phrasing changes shift behavior. Treat each constraint as a hypothesis, try the positive and negative forms, and keep the one that holds up on real cases.

Key idea

Negative instructions can backfire by keeping the forbidden idea active, so prefer positive targets, and when a real exclusion is needed pair it with a concrete positive fallback.

Check yourself

Answer to earn rating on the learn ladder.

1. Why can a negative instruction backfire?

2. How should you handle a genuine exclusion rule?