What it does
An inference server is the long running service that loads a model into memory once and answers prediction requests over an API. It turns a static artifact into a live endpoint.
Core responsibilities
- Load the model at startup and keep it warm in memory.
- Preprocess incoming requests into model inputs.
- Predict and postprocess the output into a response.
- Expose health and metrics endpoints for operations.
Throughput techniques
- Batching groups concurrent requests so the model runs once on many inputs, raising throughput.
- Concurrency handles multiple requests with worker processes or async handlers.
- Caching returns stored results for repeated identical inputs.
Latency versus throughput
Batching trades a little latency for much higher throughput. You tune the maximum batch size and wait window to meet a latency budget while serving the load.
Key idea
An inference server loads a model once and serves predictions over an API, using batching, concurrency, and caching to balance latency against throughput.