Durable Execution and Checkpoints

Code That Survives Crashes

A workflow may run for hours and call many services. If the process dies halfway, you do not want to repeat completed external calls. Durable execution lets the code resume from where it stopped as if nothing happened.

The Journal of Results

The engine records each completed step result to a durable journal. On restart it replays the code. When replay reaches a step already in the journal, it returns the saved result instead of executing again. Once replay passes the last journal entry, normal execution continues.

Determinism Is Required

Replay only works if the code follows the same path each time. So workflow code must be deterministic:

No direct clock or random reads in the flow. Route them through the engine so the value is journaled.
No nondeterministic branching on unrecorded external state.

Side effects belong in journaled steps, not loose in the workflow body.

Checkpoints Versus Replay

Replay from journal rebuilds state by re running code against recorded results. Memory light, compute heavy.
Snapshot checkpoint saves the full state periodically and restores it directly. Fast restart, larger storage.

Many engines combine both: snapshot occasionally and replay the tail.

Key idea