What reproducibility means
A run is reproducible when running it again yields the same model and metrics within tolerance. This is the foundation for trusting comparisons and debugging regressions.
The four things to pin
- Code by committing and recording the exact git hash.
- Data by referencing an immutable dataset version, not a mutable folder.
- Config by capturing every hyperparameter in a versioned file.
- Randomness by setting and logging seeds for all libraries.
Sources of nondeterminism
Even with seeds, results can drift from nondeterministic GPU kernels, parallel data loading order, and floating point reduction order. You manage these by enabling deterministic modes where available and accepting a small tolerance otherwise.
The practical test is simple. If a teammate cannot recreate your number from the recorded inputs, the run is not reproducible.
Why it pays off
Reproducibility makes A versus B comparisons honest, lets you bisect a regression to a single change, and turns a one off success into a repeatable process.
Key idea
Reproducible runs pin code, data, config, and seeds so the same inputs recreate the same model, making comparisons trustworthy and regressions debuggable.