← Lessons

quiz vs the machine

Gold1320

Machine Learning

The Model Checkpointing

Saving training state so you can resume, recover, and keep the best model.

4 min read · core · beat Gold to climb

More than saving weights

A checkpoint captures the state needed to continue or reuse training. The weights alone let you run inference, but resuming a paused run needs more: the optimizer state, the epoch and step, the learning rate schedule position, and the random seed state.

Two purposes

  • Best checkpoint saved whenever a validation metric improves, used for final deployment.
  • Latest checkpoint saved periodically so a crashed long run can resume without losing days of compute.

What goes in

Practical discipline

  • Keep the best by metric separately from the most recent, since the latest may be overfitting.
  • Save the optimizer state or momentum buffers reset and the loss spikes on resume.
  • Version checkpoints and record the config so a saved model is reproducible.

Practical notes

  • Checkpoint frequency trades disk and time against how much work a crash can cost.
  • For huge models, consider saving only every few epochs to limit storage.

Key idea

A full checkpoint stores weights plus optimizer, schedule, and seed state so training resumes cleanly. Keep the best by validation separate from the latest for safe recovery and deployment.

Check yourself

Answer to earn rating on the learn ladder.

1. Why must a resumable checkpoint store more than the model weights?

2. Why keep a best checkpoint separate from the latest one?