← Lessons

quiz vs the machine

Gold1410

Machine Learning

The Code Generation Eval

Grading generated programs by running them, not by reading them.

6 min read · core · beat Gold to climb

Execution beats text matching

Comparing generated code to a reference string fails because many correct programs look different. Code generation evaluation instead runs the code against tests. If it passes the hidden tests, it counts as correct, regardless of style.

The pass at k metric

A model may need several tries to get it right. Pass at k measures the chance that at least one of k sampled solutions passes the tests. Pass at one reflects single shot reliability, while higher k rewards models that can succeed when allowed multiple attempts.

Building a fair harness

  • Sandboxed execution so untrusted code cannot harm the host.
  • Comprehensive tests, including edge cases, not just the happy path.
  • Time and memory limits to catch infinite loops.
  • Functional checks plus optional runtime and complexity measures.

Limits and traps

Weak tests pass buggy code, so coverage matters. Popular problem sets leak into training, inflating scores. Passing tests is not the same as readable, secure, or efficient code, so production evals add static analysis and security scanning. Execution based scoring is powerful but only as honest as the test suite behind it.

Key idea

Code generation is scored by execution against hidden tests using pass at k, which is robust to surface differences but only as trustworthy as test coverage and freedom from data contamination.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is execution preferred over comparing generated code to a reference string?

2. What does pass at k measure?

3. What limits the trustworthiness of an execution based code eval?