More than saving weights
A checkpoint captures the state needed to continue or reuse training. The weights alone let you run inference, but resuming a paused run needs more: the optimizer state, the epoch and step, the learning rate schedule position, and the random seed state.
Two purposes
- Best checkpoint saved whenever a validation metric improves, used for final deployment.
- Latest checkpoint saved periodically so a crashed long run can resume without losing days of compute.
What goes in
Practical discipline
- Keep the best by metric separately from the most recent, since the latest may be overfitting.
- Save the optimizer state or momentum buffers reset and the loss spikes on resume.
- Version checkpoints and record the config so a saved model is reproducible.
Practical notes
- Checkpoint frequency trades disk and time against how much work a crash can cost.
- For huge models, consider saving only every few epochs to limit storage.
Key idea
A full checkpoint stores weights plus optimizer, schedule, and seed state so training resumes cleanly. Keep the best by validation separate from the latest for safe recovery and deployment.