The Code Generation Eval

Execution beats text matching

Comparing generated code to a reference string fails because many correct programs look different. Code generation evaluation instead runs the code against tests. If it passes the hidden tests, it counts as correct, regardless of style.

The pass at k metric

A model may need several tries to get it right. Pass at k measures the chance that at least one of k sampled solutions passes the tests. Pass at one reflects single shot reliability, while higher k rewards models that can succeed when allowed multiple attempts.

Building a fair harness

Sandboxed execution so untrusted code cannot harm the host.
Comprehensive tests, including edge cases, not just the happy path.
Time and memory limits to catch infinite loops.
Functional checks plus optional runtime and complexity measures.

Limits and traps

Weak tests pass buggy code, so coverage matters. Popular problem sets leak into training, inflating scores. Passing tests is not the same as readable, secure, or efficient code, so production evals add static analysis and security scanning. Execution based scoring is powerful but only as honest as the test suite behind it.

Key idea

Code generation is scored by execution against hidden tests using pass at k, which is robust to surface differences but only as trustworthy as test coverage and freedom from data contamination.

The Code Generation Eval

Execution beats text matching

The pass at k metric

Building a fair harness

Limits and traps

Key idea

Check yourself