From artifact to service
A trained model is just a file of weights. Serving is the work of wrapping that file behind an interface so applications can send inputs and get predictions back over the network.
Common patterns
- Embedded serving runs the model inside the application process. It is simple and fast but ties model and app together.
- Dedicated service serving runs the model behind its own API. The app calls it over HTTP or RPC, so the two scale and deploy independently.
- Batch serving runs predictions offline on stored data and writes results to a table for later lookup.
The serving server
A serving server loads weights into memory once, then handles many requests against that warm copy. It manages threads, queues, and hardware so each request does not pay the startup cost.
Why separate the model
- The model can use GPUs while the app runs on cheap CPUs.
- Teams can update the model without redeploying the whole app.
- Several apps can share one model endpoint.
Key idea
Serving turns a static weights file into a live endpoint. Choosing embedded, dedicated, or batch serving is about how tightly the model should be coupled to the application and how it scales.