← Lessons

quiz vs the machine

Platinum1870

System Design

Durable Execution and Checkpoints

Resume long running code after a crash by replaying from a journal.

6 min read · advanced · beat Platinum to climb

Code That Survives Crashes

A workflow may run for hours and call many services. If the process dies halfway, you do not want to repeat completed external calls. Durable execution lets the code resume from where it stopped as if nothing happened.

The Journal of Results

The engine records each completed step result to a durable journal. On restart it replays the code. When replay reaches a step already in the journal, it returns the saved result instead of executing again. Once replay passes the last journal entry, normal execution continues.

Determinism Is Required

Replay only works if the code follows the same path each time. So workflow code must be deterministic:

  • No direct clock or random reads in the flow. Route them through the engine so the value is journaled.
  • No nondeterministic branching on unrecorded external state.

Side effects belong in journaled steps, not loose in the workflow body.

Checkpoints Versus Replay

  • Replay from journal rebuilds state by re running code against recorded results. Memory light, compute heavy.
  • Snapshot checkpoint saves the full state periodically and restores it directly. Fast restart, larger storage.

Many engines combine both: snapshot occasionally and replay the tail.

Key idea

Durable execution journals each step result and replays deterministic code on restart, so completed work is never repeated.

Check yourself

Answer to earn rating on the learn ladder.

1. How does durable execution avoid repeating completed steps after a crash?

2. Why must durable workflow code be deterministic?

3. Where should a clock or random read happen in a durable workflow?