Assume the model will fail
Dependencies time out, models throw errors, and features go missing. A resilient system degrades gracefully instead of breaking.
Layers of fallback
Order them from best to safest.
- Primary model the full quality prediction
- Simpler model a fast cached or lightweight backup
- Heuristic a rule based default, such as most popular items
- Static default a safe constant response
Timeouts and circuit breakers
- Set a timeout so a slow model never blocks the request
- A circuit breaker stops calling a failing model and serves the fallback
- Default features fill in when a feature lookup fails
Fail loud internally
Degrade gracefully for the user but emit a metric or alert so engineers know the fallback fired and why.
Key idea
Design fallbacks as a ladder from best to safest, protected by timeouts and circuit breakers, so users stay served while engineers get alerted.