Two serving styles
Batch inference computes predictions for many items on a schedule and stores the results. When a request arrives, you look up the precomputed answer. Real time inference computes a prediction on demand for each request as it arrives.
When batch fits
- Predictions do not need to reflect the latest event, like a daily product recommendation
- Inputs are known in advance, so you can score the whole population overnight
- You want simple, cheap serving that is just a key lookup
When real time fits
- Inputs are only known at request time, like a fraud check on a new transaction
- Freshness matters, so a stale prediction would be wrong
- The space of possible inputs is too large to precompute
The tradeoffs
Batch is cheaper and simpler but its predictions are stale and only cover precomputable inputs. Real time is fresh and flexible but needs a low latency service, careful scaling, and tight monitoring. Many systems blend both, precomputing what they can and scoring live only when needed.
Key idea
Batch precomputes predictions cheaply for known inputs; real time scores live when freshness or unknown inputs demand it.