Two different questions
A health check answers a simple yes or no, but there are two distinct questions hiding inside it.
- Liveness: is the process alive at all, or is it stuck and needs a restart.
- Readiness: is the process able to serve requests right now, with its caches warm and dependencies reachable.
Confusing them causes outages. If you restart on a failed readiness check, a brief dependency blip will reboot every instance at once.
What a good check tests
- A shallow check confirms the process responds and event loop is not blocked.
- A deep check verifies critical dependencies like the database are reachable.
Deep checks are powerful but dangerous: if every instance health checks the same database, one slow database can mark the whole fleet unhealthy and remove all capacity.
Startup behavior
New instances need time to warm up. A startup grace period lets a pod boot before liveness probes start, so slow starts are not mistaken for crashes.
Key idea
Separate liveness from readiness so restarts and traffic routing react to the right signal.