← Lessons

quiz vs the machine

Gold1480

Machine Learning

Prompt Injection Defense Revisited

Guarding a model when untrusted text enters the prompt.

6 min read · core · beat Gold to climb

The threat

Prompt injection happens when untrusted content, such as a web page or a document, contains instructions that hijack the model away from its real task. Because the model treats all text as language, malicious instructions buried in data can be obeyed if you are not careful.

Why it is hard

  • The model cannot reliably tell data apart from instructions on its own.
  • Injected text can ask the model to ignore prior rules or leak secrets.
  • Tool enabled agents can be steered to take harmful actions.

Layered defenses

  • Separate roles so untrusted content is clearly marked as data, not commands.
  • Least privilege for any tools, limiting what a hijacked model can do.
  • Output validation before acting on a response, checking for unexpected commands.
  • Human approval for sensitive operations like sending mail or deleting data.

A realistic stance

No single trick fully solves injection today. Treat model output as untrusted and design the surrounding system so a successful injection has limited blast radius. Defense in depth beats relying on the prompt alone.

Key idea

Prompt injection lets untrusted text hijack a model, and because no single defense is complete, layered controls and limited tool privileges keep the blast radius small.

Check yourself

Answer to earn rating on the learn ladder.

1. What is prompt injection?

2. Which strategy best limits injection damage?