Law 26 · Evaluation & Measurement
Vibes Don't Scale
Eyeballing outputs feels like progress until you can't tell if a change helped.

The principle
The common root cause of failed LLM products is the absence of robust evals: teams ship on vibe checks, iterate blindly, and can't measure whether a prompt change improved anything. Manual spot-checking doesn't survive scale or a second engineer. Evals are to AI products what unit tests are to software — the up-front cost that makes every later change cheap and safe.
Why it happens
Manual spot-checking is unmeasured and unrepeatable, so it cannot tell you whether a prompt change improved anything or just changed something, and it collapses the moment a second engineer or a tenth example enters the picture. Generic off-the-shelf metrics do not rescue you either: practitioners report that n-gram and embedding-similarity scores prove unreliable or impractical for real tasks, which is why task-specific, re-runnable assertions are the unit that actually holds. The core asymmetry is the same as unit tests in software: the up-front cost of an eval harness is what makes every later change cheap, safe, and comparable. Without that harness you are iterating blind, and blind iteration on a non-deterministic system tends to trade one failure for another you never see.
Watch for
- Prompt changes are judged by eyeballing a few outputs in a playground and nodding.
- Nobody can state whether last week's change actually helped, only that it felt better.
- A second person tweaks the prompt and silently regresses cases nobody re-checked.
In practice
Your team iterates on the summarization prompt by eyeballing a few outputs in the playground, nodding, and shipping. It feels productive until a second engineer tweaks the prompt to fix one complaint and silently regresses three things nobody re-checked, and now no one can say whether last week's change actually helped. Vibe checks do not survive a second person or a tenth example. Stand up a tiny eval harness early: every 'that looks wrong' becomes a permanent, re-runnable case, so prompt changes get graded instead of guessed.
Apply it
- Stand up a small re-runnable eval set before scaling, and run it on every prompt or model change.
- Turn every that looks wrong moment into a permanent test case with an expected outcome.
- Prefer task-specific checks over generic similarity scores, since the latter often fail to track real quality.
The takeaway
Build a small eval harness before you scale. Turn every 'that looks wrong' moment into a permanent, re-runnable test case.