quiz vs the machine

Gold1470

Machine Learning

The Jailbreak and Prompt Injection Defense

How attackers bypass safety rules and what layered defenses help.

6 min read · core · beat Gold to climb

Two related attacks

A jailbreak crafts a prompt that tricks the model into ignoring its safety rules, for example role play or obfuscation.
A prompt injection hides malicious instructions inside content the model reads, such as a web page or document, hijacking its behavior.

Why they work

The model treats all text in its context similarly, so it struggles to separate trusted instructions from untrusted data.
Safety training is imperfect and attackers explore phrasings it never saw.

Layered defenses

Instruction hierarchy: train the model to trust the system prompt over user text and untrusted content.
Input and output filtering: classifiers scan for known attack patterns and unsafe outputs.
Privilege separation: keep dangerous tools behind explicit confirmations, so injected text cannot silently act.
Content provenance: clearly mark untrusted data and discount instructions found inside it.

A realistic stance

No single defense is complete, so defenses are stacked.
Treat the model like a confused deputy and never grant the raw output unchecked power over tools or data.

Key idea

Jailbreaks and prompt injections exploit the model's inability to separate trusted instructions from untrusted data, so defense layers an instruction hierarchy, filtering, and privilege separation rather than relying on any single barrier.

Check yourself

Answer to earn rating on the learn ladder.

1. How does prompt injection differ from a jailbreak?

2. What is the instruction hierarchy defense?

3. Why is privilege separation important against these attacks?