← Lessons

quiz vs the machine

Gold1340

Machine Learning

Caching Model Responses

Skip recomputing answers for inputs you have already seen.

4 min read · core · beat Gold to climb

The repeated input problem

Many inputs repeat. The same image, the same exact prompt, the same query come in again and again. Recomputing the model each time wastes compute. Response caching stores the output for an input so a repeat is served instantly.

How it works

The service builds a key from the input, often a hash of it, and looks it up in a cache. On a hit it returns the stored answer with no model call. On a miss it runs the model and stores the result for next time.

When it pays off

  • Inputs repeat often, giving a high hit rate.
  • The model is expensive, so each skipped call saves real cost.
  • Outputs are stable for a given input.

The catch

  • A deterministic model is needed, or you must accept the cached answer for variant requests.
  • Inputs with tiny differences produce different keys and miss the cache.
  • Stale answers can be wrong if the model is updated, so caches need a clear way to expire.

Key idea

Response caching keys outputs by input so repeats are returned without a model call. It saves the most when inputs repeat and the model is costly, but needs stable outputs and a plan to expire stale entries.

Check yourself

Answer to earn rating on the learn ladder.

1. When does response caching deliver the most savings?

2. Why might a cache need an expiry policy?