Law 30 · Evaluation & Measurement

Regress or Repeat

Every fixed bug is a future regression unless it becomes a test.

Diagram explaining Regress or Repeat

The principle

LLM systems are non-deterministic and globally coupled — a prompt tweak to fix one case silently breaks three others. Rerunning real production examples against a new prompt is the only way to know you didn't break what already worked. Without a regression suite you're trapped in a whack-a-mole loop, re-discovering the same failures release after release.

Why it happens

LLM systems are non-deterministic and globally coupled: the same prompt can yield different outputs across runs even at temperature zero because batching and floating-point execution on parallel hardware are not bit-reproducible, and a study across five models and eight tasks found accuracy varying by up to 15% across nominally identical runs. On top of that run-to-run variance, the prompt is a single shared control surface, so a change that fixes one case routinely shifts behavior on unrelated cases that share the same instructions. Together these mean you cannot reason your way to confidence that a fix is safe; you have to re-run the real prior cases and observe. Without a regression suite that captures every fixed bug as a permanent case, you are stuck in a whack-a-mole loop, re-discovering the same failures release after release.

Watch for

In practice

A user reports the agent mishandles refunds over $1,000, you tweak the prompt, confirm that one case works, and ship. Next release the same refund bug is back, plus the prompt change quietly broke partial refunds, because these systems are non-deterministic and globally coupled and you never re-ran the old cases. Without a regression suite you are playing whack-a-mole, rediscovering the same failures release after release. Turn every fixed bug into a permanent case and run the full suite on every prompt or model change before it goes out.

Apply it

  1. Turn every fixed bug into a permanent regression case with its expected output.
  2. Run the full regression suite on every prompt and model change before shipping.
  3. Because outputs vary run to run, evaluate over repeated runs rather than trusting a single pass.

The takeaway

Every failure you fix becomes a permanent case in your regression eval. Run the full suite on every prompt or model change before shipping.

Sources and further reading

Related laws

Read every law in the digital edition Back to all 50 laws