The Model Checkpointing

More than saving weights

A checkpoint captures the state needed to continue or reuse training. The weights alone let you run inference, but resuming a paused run needs more: the optimizer state, the epoch and step, the learning rate schedule position, and the random seed state.

Two purposes

Best checkpoint saved whenever a validation metric improves, used for final deployment.
Latest checkpoint saved periodically so a crashed long run can resume without losing days of compute.

What goes in

Practical discipline

Keep the best by metric separately from the most recent, since the latest may be overfitting.
Save the optimizer state or momentum buffers reset and the loss spikes on resume.
Version checkpoints and record the config so a saved model is reproducible.

Practical notes

Checkpoint frequency trades disk and time against how much work a crash can cost.
For huge models, consider saving only every few epochs to limit storage.

Key idea

A full checkpoint stores weights plus optimizer, schedule, and seed state so training resumes cleanly. Keep the best by validation separate from the latest for safe recovery and deployment.