The reproducibility problem
A concurrency bug may appear once in a thousand runs. Without a way to reproduce it, you cannot study it under a debugger.
What replay records
Deterministic replay records the few sources of nondeterminism so a later run follows the same path:
- the order threads acquired locks
- the results of nondeterministic reads such as inputs and timers
- the interleaving of accesses to shared memory
During replay the recorded log forces every choice, so the failing run repeats exactly.
The cost tradeoff
Recording everything is expensive. Practical systems log only the scheduling decisions and replay deterministic computation, keeping the log small while still reproducing the bug.
Key idea
Replay turns a rare nondeterministic failure into a repeatable one by logging the scheduling and input choices, then forcing those exact choices on every replay.