From artifact to service
A trained model is a file. Serving wraps it in an API, scales it, and keeps it healthy.
Core components
- Model server loads the artifact and exposes a predict endpoint
- Load balancer spreads requests across replicas
- Autoscaler adds replicas under load and removes them when idle
- Model registry versions artifacts and tracks which is live
Deployment patterns
- Embedded model runs inside the application process, lowest latency
- Service a separate prediction microservice, easier to scale and update
- Edge model runs on device, no network round trip
Safe rollouts
- Canary send a small slice of traffic to the new model first
- Shadow run the new model alongside the old without serving its output
- Rollback keep the previous version one click away
Decoupling the model service from the app lets you update the model without redeploying everything.
Key idea
Serving turns a file into a versioned, autoscaling service with safe rollouts and instant rollback.