Watching more than loss
Training loss falling does not prove a model is improving on what matters. Evaluation during fine tuning tracks the signals that reveal whether the model is genuinely getting better, overfitting, or regressing on prior abilities.
What to monitor
- A held out validation set to catch overfitting as training loss drops but validation rises.
- Target task metrics, not just loss, since loss and task quality can diverge.
- Regression checks on general benchmarks to detect catastrophic forgetting.
The loop
Practical pitfalls
Fine tuning sets are small, so validation can be noisy; averaging and multiple seeds help. Data leakage between tune and eval sets inflates scores, so keep them strictly separate. Because models can improve on the target while degrading elsewhere, a small broad eval suite alongside the target metric gives the full picture and guides early stopping or checkpoint selection.
Key idea
Evaluation during fine tuning watches held out task metrics and regression checks, not just training loss, to catch overfitting and forgetting and to pick the best checkpoint.