Prompts are code too
A prompt that drives production is a fragile artifact. Versioning and testing treat it like code, with a stored version, a test suite, and a record of which version ran, so a tweak cannot silently break behavior.
What to track
- Version identity so each prompt change has a label you can roll back to.
- A test set of inputs with expected properties of the output.
- Metrics that turn each run into a pass or a graded score.
- Linkage recording which prompt version produced which output in logs.
Test like software
Build a suite of representative and edge cases, run a new prompt against it, and compare scores to the current version before promoting. Catch regressions in the suite rather than from user complaints. Use a model or rules to grade open ended outputs.
Roll out carefully
Because models are stochastic, a passing average can hide rare failures. Stage changes behind a flag, compare versions on live traffic, and keep the ability to revert instantly if a new prompt degrades a metric.
Key idea
Versioning and testing treat prompts as code, with labeled versions, a graded test suite, and logged linkage, so changes are compared and staged rather than shipped blind into stochastic production behavior.