Law 26 · Evaluation & Measurement

Vibes Don't Scale

Eyeballing outputs feels like progress until you can't tell if a change helped.

The principle

The common root cause of failed LLM products is the absence of robust evals: teams ship on vibe checks, iterate blindly, and can't measure whether a prompt change improved anything. Manual spot-checking doesn't survive scale or a second engineer. Evals are to AI products what unit tests are to software — the up-front cost that makes every later change cheap and safe.

Why it happens

Manual spot-checking is unmeasured and unrepeatable, so it cannot tell you whether a prompt change improved anything or just changed something, and it collapses the moment a second engineer or a tenth example enters the picture. Generic off-the-shelf metrics do not rescue you either: practitioners report that n-gram and embedding-similarity scores prove unreliable or impractical for real tasks, which is why task-specific, re-runnable assertions are the unit that actually holds. The core asymmetry is the same as unit tests in software: the up-front cost of an eval harness is what makes every later change cheap, safe, and comparable. Without that harness you are iterating blind, and blind iteration on a non-deterministic system tends to trade one failure for another you never see.

Watch for

Prompt changes are judged by eyeballing a few outputs in a playground and nodding.
Nobody can state whether last week's change actually helped, only that it felt better.
A second person tweaks the prompt and silently regresses cases nobody re-checked.

In practice

Your team iterates on the summarization prompt by eyeballing a few outputs in the playground, nodding, and shipping. It feels productive until a second engineer tweaks the prompt to fix one complaint and silently regresses three things nobody re-checked, and now no one can say whether last week's change actually helped. Vibe checks do not survive a second person or a tenth example. Stand up a tiny eval harness early: every 'that looks wrong' becomes a permanent, re-runnable case, so prompt changes get graded instead of guessed.

Apply it

Stand up a small re-runnable eval set before scaling, and run it on every prompt or model change.
Turn every that looks wrong moment into a permanent test case with an expected outcome.
Prefer task-specific checks over generic similarity scores, since the latter often fail to track real quality.

The takeaway

Build a small eval harness before you scale. Turn every 'that looks wrong' moment into a permanent, re-runnable test case.

Sources and further reading

Read every law in the digital edition Back to all 50 laws

The principle

Why it happens

Watch for

Apply it

Sources and further reading

Related laws