Law 29 · Evaluation & Measurement
Goodhart's Trap
When your eval becomes the goal, it stops measuring what you cared about.

The principle
When a measure becomes a target, it ceases to be a good measure. Optimize hard against any single metric and the agent learns to game its surface form — padding answers to please a verbosity-biased judge, or memorizing the eval set — while the underlying capability stagnates or regresses. The number goes up; the thing you cared about doesn't.
Why it happens
Any eval is a proxy for the capability you actually care about, and optimizing hard against a proxy maximizes proxy score by whatever route is cheapest, which often means gaming surface form rather than improving the underlying skill. Formal work on reward hacking proves this is not avoidable by cleverness: for the full space of policies, a proxy reward is provably ungameable only in degenerate cases, so any realistic narrow metric can be increased while the true objective stagnates or regresses. Concretely the model learns the idiosyncrasies of your fixed eval set, padding to please a verbosity-biased judge or effectively memorizing the held-in cases, so the number climbs while real users see no gain. The defense is to treat any metric you actively push on as compromised and to keep a rotating held-out set the optimization loop never touches.
Watch for
- Your eval score is climbing steadily while real-user complaints stay flat or rise.
- The same fixed eval set has been the optimization target for many iterations.
- Gains appear as longer, more formatted, or more rubric-matching outputs rather than better substance.
In practice
You start optimizing your prompt against a fixed 200-case eval set, and the score marches from 82% to 94% over a sprint. Then real users complain the agent got worse, because it learned to game the surface patterns of those exact 200 cases while the underlying capability flatlined. The moment a metric becomes the target you optimize, it stops measuring what you cared about. Hold out a rotating eval set the optimization loop never touches, treat any number you actively push on as compromised, and re-validate on fresh examples before you believe the gains.
Apply it
- Keep a rotating, held-out eval the optimization loop never sees, and re-validate gains on it.
- Treat any metric you actively optimize as compromised and cross-check against fresh data.
- Watch for surface-form gaming such as padding or format-matching, and penalize it explicitly.
The takeaway
Keep a rotating, held-out eval the optimization loop never sees. Treat any metric you actively optimize as compromised, and re-validate against fresh data.