Two serving modes
You can compute predictions ahead of time or at request time. The right choice depends on freshness needs and the input space.
Batch inference
Compute predictions on a schedule and store them.
- Good when inputs are known and change slowly, such as daily user scores
- Pros simple serving, predictable cost, no latency pressure
- Cons stale between runs, wasteful if most predictions go unused
Real time inference
Compute on demand when the request arrives.
- Good when inputs are fresh or the input space is huge, such as search
- Pros always current, only computes what is needed
- Cons tight latency budget, harder infrastructure
Hybrid
Many systems precompute heavy embeddings in batch, then do a light real time pass to combine them with fresh context. This captures most freshness at a fraction of the cost.
Key idea
Batch when inputs are known and slow changing; serve real time when freshness or a vast input space demands it. Hybrids often win.