The repeated input problem
Many inputs repeat. The same image, the same exact prompt, the same query come in again and again. Recomputing the model each time wastes compute. Response caching stores the output for an input so a repeat is served instantly.
How it works
The service builds a key from the input, often a hash of it, and looks it up in a cache. On a hit it returns the stored answer with no model call. On a miss it runs the model and stores the result for next time.
When it pays off
- Inputs repeat often, giving a high hit rate.
- The model is expensive, so each skipped call saves real cost.
- Outputs are stable for a given input.
The catch
- A deterministic model is needed, or you must accept the cached answer for variant requests.
- Inputs with tiny differences produce different keys and miss the cache.
- Stale answers can be wrong if the model is updated, so caches need a clear way to expire.
Key idea
Response caching keys outputs by input so repeats are returned without a model call. It saves the most when inputs repeat and the model is costly, but needs stable outputs and a plan to expire stale entries.