The Eval During Fine Tuning

Watching more than loss

Training loss falling does not prove a model is improving on what matters. Evaluation during fine tuning tracks the signals that reveal whether the model is genuinely getting better, overfitting, or regressing on prior abilities.

What to monitor

A held out validation set to catch overfitting as training loss drops but validation rises.
Target task metrics, not just loss, since loss and task quality can diverge.
Regression checks on general benchmarks to detect catastrophic forgetting.

The loop

Practical pitfalls

Fine tuning sets are small, so validation can be noisy; averaging and multiple seeds help. Data leakage between tune and eval sets inflates scores, so keep them strictly separate. Because models can improve on the target while degrading elsewhere, a small broad eval suite alongside the target metric gives the full picture and guides early stopping or checkpoint selection.

Key idea

Evaluation during fine tuning watches held out task metrics and regression checks, not just training loss, to catch overfitting and forgetting and to pick the best checkpoint.

The Eval During Fine Tuning

Watching more than loss

What to monitor

The loop

Practical pitfalls

Key idea

Check yourself