← Lessons

quiz vs the machine

Silver1100

Machine Learning

The Checkpoint and Resume Training

Save full training state so a long run survives crashes and preemptions.

4 min read · intro · beat Silver to climb

Surviving long runs

Large training jobs run for days, and hardware fails or gets preempted. Checkpointing periodically saves the full training state so the job can resume exactly where it stopped rather than starting over.

  • Save the weights and the optimizer state.
  • Save the step count, schedule position, and data position.
  • On restart, load the checkpoint and continue.

Doing it right

A checkpoint must capture everything needed to reproduce the next step, not just the weights. Forgetting the optimizer state or the learning rate schedule position causes a visible loss spike on resume. Writing atomically avoids corrupt files if a crash hits mid save.

  • Include the random number generator state for reproducibility.
  • Write to a temp file then rename to stay atomic.
  • Keep a few recent checkpoints in case one is corrupt.

Save and restore

A complete checkpoint turns a multi day run into a sequence of safely resumable segments.

Key idea

Checkpointing saves the full training state including optimizer and schedule so a long run resumes exactly where it stopped after a crash or preemption.

Check yourself

Answer to earn rating on the learn ladder.

1. Why save the optimizer state in a checkpoint?

2. Why write a checkpoint atomically?