What a cold start is
A cold start is the delay when a fresh serving instance must load model weights before it can answer. The first request waits for files to be read from disk or network and copied into memory or GPU memory.
Why it can be long
- Large models are many gigabytes that take time to read and transfer.
- Copying weights to GPU memory adds extra time.
- Frameworks may compile or warm up kernels on the first call.
Why it hurts in production
Cold starts strike exactly when you scale up under load or recover from a crash. New instances are slow precisely when you need them most, so a spike can see timeouts while instances warm.
How to soften it
- Keep a pool of warm instances already loaded and idle.
- Cache weights close to the server to speed reads.
- Send a dummy warmup request so kernels compile before real traffic.
Key idea
A cold start is the slow first request while weights load and kernels warm. Because it hits during scale ups and recovery, teams keep warm pools, cache weights, and send warmup requests to hide it.