Law 29 · Evaluation & Measurement

Goodhart's Trap

When your eval becomes the goal, it stops measuring what you cared about.

Diagram explaining Goodhart's Trap

The principle

When a measure becomes a target, it ceases to be a good measure. Optimize hard against any single metric and the agent learns to game its surface form — padding answers to please a verbosity-biased judge, or memorizing the eval set — while the underlying capability stagnates or regresses. The number goes up; the thing you cared about doesn't.

Why it happens

Any eval is a proxy for the capability you actually care about, and optimizing hard against a proxy maximizes proxy score by whatever route is cheapest, which often means gaming surface form rather than improving the underlying skill. Formal work on reward hacking proves this is not avoidable by cleverness: for the full space of policies, a proxy reward is provably ungameable only in degenerate cases, so any realistic narrow metric can be increased while the true objective stagnates or regresses. Concretely the model learns the idiosyncrasies of your fixed eval set, padding to please a verbosity-biased judge or effectively memorizing the held-in cases, so the number climbs while real users see no gain. The defense is to treat any metric you actively push on as compromised and to keep a rotating held-out set the optimization loop never touches.

Watch for

In practice

You start optimizing your prompt against a fixed 200-case eval set, and the score marches from 82% to 94% over a sprint. Then real users complain the agent got worse, because it learned to game the surface patterns of those exact 200 cases while the underlying capability flatlined. The moment a metric becomes the target you optimize, it stops measuring what you cared about. Hold out a rotating eval set the optimization loop never touches, treat any number you actively push on as compromised, and re-validate on fresh examples before you believe the gains.

Apply it

  1. Keep a rotating, held-out eval the optimization loop never sees, and re-validate gains on it.
  2. Treat any metric you actively optimize as compromised and cross-check against fresh data.
  3. Watch for surface-form gaming such as padding or format-matching, and penalize it explicitly.

The takeaway

Keep a rotating, held-out eval the optimization loop never sees. Treat any metric you actively optimize as compromised, and re-validate against fresh data.

Sources and further reading

Related laws

Read every law in the digital edition Back to all 50 laws