Reproducible Training Runs

What reproducibility means

A run is reproducible when running it again yields the same model and metrics within tolerance. This is the foundation for trusting comparisons and debugging regressions.

The four things to pin

Code by committing and recording the exact git hash.
Data by referencing an immutable dataset version, not a mutable folder.
Config by capturing every hyperparameter in a versioned file.
Randomness by setting and logging seeds for all libraries.

Sources of nondeterminism

Even with seeds, results can drift from nondeterministic GPU kernels, parallel data loading order, and floating point reduction order. You manage these by enabling deterministic modes where available and accepting a small tolerance otherwise.

The practical test is simple. If a teammate cannot recreate your number from the recorded inputs, the run is not reproducible.

Why it pays off

Reproducibility makes A versus B comparisons honest, lets you bisect a regression to a single change, and turns a one off success into a repeatable process.

Key idea

Reproducible runs pin code, data, config, and seeds so the same inputs recreate the same model, making comparisons trustworthy and regressions debuggable.

Reproducible Training Runs

What reproducibility means

The four things to pin

Sources of nondeterminism

Why it pays off

Key idea

Check yourself