← Lessons

quiz vs the machine

Silver1050

Machine Learning

Model Serving Architectures

How a trained model becomes a service that answers requests.

4 min read · intro · beat Silver to climb

From artifact to service

A trained model is just a file of weights. Serving is the work of wrapping that file behind an interface so applications can send inputs and get predictions back over the network.

Common patterns

  • Embedded serving runs the model inside the application process. It is simple and fast but ties model and app together.
  • Dedicated service serving runs the model behind its own API. The app calls it over HTTP or RPC, so the two scale and deploy independently.
  • Batch serving runs predictions offline on stored data and writes results to a table for later lookup.

The serving server

A serving server loads weights into memory once, then handles many requests against that warm copy. It manages threads, queues, and hardware so each request does not pay the startup cost.

Why separate the model

  • The model can use GPUs while the app runs on cheap CPUs.
  • Teams can update the model without redeploying the whole app.
  • Several apps can share one model endpoint.

Key idea

Serving turns a static weights file into a live endpoint. Choosing embedded, dedicated, or batch serving is about how tightly the model should be coupled to the application and how it scales.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the main benefit of a dedicated model service over embedded serving?

2. Why does a serving server load weights only once at startup?