The threat
Prompt injection happens when untrusted content, such as a web page or a document, contains instructions that hijack the model away from its real task. Because the model treats all text as language, malicious instructions buried in data can be obeyed if you are not careful.
Why it is hard
- The model cannot reliably tell data apart from instructions on its own.
- Injected text can ask the model to ignore prior rules or leak secrets.
- Tool enabled agents can be steered to take harmful actions.
Layered defenses
- Separate roles so untrusted content is clearly marked as data, not commands.
- Least privilege for any tools, limiting what a hijacked model can do.
- Output validation before acting on a response, checking for unexpected commands.
- Human approval for sensitive operations like sending mail or deleting data.
A realistic stance
No single trick fully solves injection today. Treat model output as untrusted and design the surrounding system so a successful injection has limited blast radius. Defense in depth beats relying on the prompt alone.
Key idea
Prompt injection lets untrusted text hijack a model, and because no single defense is complete, layered controls and limited tool privileges keep the blast radius small.