The risk of a full swap
Replacing a serving model all at once is dangerous. A new version can be slower, more expensive, or quietly worse on real inputs in ways tests miss. A canary deploy limits that blast radius.
How a canary works
A small slice of live traffic, perhaps one or five percent, is routed to the new model while the rest stays on the stable one. The team watches metrics for the canary slice and compares them to the stable baseline.
What to compare
- Latency and cost to catch performance regressions.
- Quality signals such as user feedback, click through, or error rates.
- Output drift where the new model answers very differently.
Promote or roll back
If the canary looks healthy, traffic is shifted gradually until the new model serves everyone. If metrics worsen, traffic is pulled back to the stable model immediately. The small initial slice means few users ever see a bad version.
Key idea
A canary deploy routes a small slice of live traffic to a new model and compares its latency, cost, and quality against the stable baseline. Healthy canaries are promoted gradually and bad ones rolled back fast, so few users ever meet a flawed version.