Law 30 · Evaluation & Measurement

Regress or Repeat

Every fixed bug is a future regression unless it becomes a test.

The principle

LLM systems are non-deterministic and globally coupled — a prompt tweak to fix one case silently breaks three others. Rerunning real production examples against a new prompt is the only way to know you didn't break what already worked. Without a regression suite you're trapped in a whack-a-mole loop, re-discovering the same failures release after release.

Why it happens

LLM systems are non-deterministic and globally coupled: the same prompt can yield different outputs across runs even at temperature zero because batching and floating-point execution on parallel hardware are not bit-reproducible, and a study across five models and eight tasks found accuracy varying by up to 15% across nominally identical runs. On top of that run-to-run variance, the prompt is a single shared control surface, so a change that fixes one case routinely shifts behavior on unrelated cases that share the same instructions. Together these mean you cannot reason your way to confidence that a fix is safe; you have to re-run the real prior cases and observe. Without a regression suite that captures every fixed bug as a permanent case, you are stuck in a whack-a-mole loop, re-discovering the same failures release after release.

Watch for

A bug you fixed last release has reappeared because nobody re-ran the old case.
A prompt tweak aimed at one case silently broke a different, unrelated case.
You ship prompt or model changes without re-running the previously passing examples.

In practice

A user reports the agent mishandles refunds over $1,000, you tweak the prompt, confirm that one case works, and ship. Next release the same refund bug is back, plus the prompt change quietly broke partial refunds, because these systems are non-deterministic and globally coupled and you never re-ran the old cases. Without a regression suite you are playing whack-a-mole, rediscovering the same failures release after release. Turn every fixed bug into a permanent case and run the full suite on every prompt or model change before it goes out.

Apply it

Turn every fixed bug into a permanent regression case with its expected output.
Run the full regression suite on every prompt and model change before shipping.
Because outputs vary run to run, evaluate over repeated runs rather than trusting a single pass.

The takeaway

Every failure you fix becomes a permanent case in your regression eval. Run the full suite on every prompt or model change before shipping.

Sources and further reading

Read every law in the digital edition Back to all 50 laws

The principle

Why it happens

Watch for

Apply it

Sources and further reading

Related laws