Surviving long runs
Large training jobs run for days, and hardware fails or gets preempted. Checkpointing periodically saves the full training state so the job can resume exactly where it stopped rather than starting over.
- Save the weights and the optimizer state.
- Save the step count, schedule position, and data position.
- On restart, load the checkpoint and continue.
Doing it right
A checkpoint must capture everything needed to reproduce the next step, not just the weights. Forgetting the optimizer state or the learning rate schedule position causes a visible loss spike on resume. Writing atomically avoids corrupt files if a crash hits mid save.
- Include the random number generator state for reproducibility.
- Write to a temp file then rename to stay atomic.
- Keep a few recent checkpoints in case one is corrupt.
Save and restore
A complete checkpoint turns a multi day run into a sequence of safely resumable segments.
Key idea
Checkpointing saves the full training state including optimizer and schedule so a long run resumes exactly where it stopped after a crash or preemption.