Apples to apples
To claim model A beats model B you must compare them fairly. A difference caused by an uneven setup is not a real improvement. Control everything except the change under test.
- Use the same train, validation, and test splits.
- Give each model a fair tuning budget, not just one a head start.
- Evaluate with the same metric and preprocessing.
Noise and significance
A single test score has variance. A small gap may be noise, especially on a small test set.
- Run multiple seeds and report mean and spread.
- Use a significance check or confidence interval on the gap.
- Beware tuning one model on the test set, a form of leakage.
A fair protocol
Only then does a win mean something.
Key idea
Fair model comparison fixes splits, metric, and tuning budget across candidates and checks the gap against seed variance, so the reported winner reflects a real difference rather than setup luck.