← Lessons

quiz vs the machine

Gold1460

Machine Learning

The Inference Server

The service that loads a model and answers prediction requests.

5 min read · core · beat Gold to climb

What it does

An inference server is the long running service that loads a model into memory once and answers prediction requests over an API. It turns a static artifact into a live endpoint.

Core responsibilities

  • Load the model at startup and keep it warm in memory.
  • Preprocess incoming requests into model inputs.
  • Predict and postprocess the output into a response.
  • Expose health and metrics endpoints for operations.

Throughput techniques

  • Batching groups concurrent requests so the model runs once on many inputs, raising throughput.
  • Concurrency handles multiple requests with worker processes or async handlers.
  • Caching returns stored results for repeated identical inputs.

Latency versus throughput

Batching trades a little latency for much higher throughput. You tune the maximum batch size and wait window to meet a latency budget while serving the load.

Key idea

An inference server loads a model once and serves predictions over an API, using batching, concurrency, and caching to balance latency against throughput.

Check yourself

Answer to earn rating on the learn ladder.

1. Why does an inference server load the model at startup rather than per request?

2. What does request batching primarily improve, and at what cost?