Rules the model should hold
Guardrails are the constraints that keep a model inside safe and intended behavior, such as refusing harmful requests, staying on topic, and never leaking secrets. Prompt level guardrails state these as standing rules, usually in the system prompt.
What prompt guardrails cover
- Scope that defines what the assistant will and will not do.
- Refusal policy for requests it should decline, with a graceful response.
- Output limits like no personal data or no medical diagnosis.
- Tone and safety rules that hold across every turn.
Make them concrete
Vague rules leak. Pair each prohibition with a positive fallback that says what to do instead, give a short refusal template, and use delimiters so user text cannot pose as a new instruction. Place durable rules in the system prompt where they take precedence.
Defense in depth
Prompts alone are not a hard boundary, since a determined user may craft an injection. Treat prompt guardrails as one layer, backed by input and output filters, allow lists, and monitoring, so a single bypass does not defeat the whole system.
Key idea
Prompt guardrails state scope, refusal, and output limits as standing rules, made concrete with fallbacks and delimiters, but they are one layer in defense in depth rather than a hard boundary on their own.