Batch versus Real Time Inference

Two serving modes

You can compute predictions ahead of time or at request time. The right choice depends on freshness needs and the input space.

Batch inference

Compute predictions on a schedule and store them.

Good when inputs are known and change slowly, such as daily user scores
Pros simple serving, predictable cost, no latency pressure
Cons stale between runs, wasteful if most predictions go unused

Real time inference

Compute on demand when the request arrives.

Good when inputs are fresh or the input space is huge, such as search
Pros always current, only computes what is needed
Cons tight latency budget, harder infrastructure

Hybrid

Many systems precompute heavy embeddings in batch, then do a light real time pass to combine them with fresh context. This captures most freshness at a fraction of the cost.

Key idea