Law 26 · Evaluation & Measurement

Vibes Don't Scale

Eyeballing outputs feels like progress until you can't tell if a change helped.

Diagram explaining Vibes Don't Scale

The principle

The common root cause of failed LLM products is the absence of robust evals: teams ship on vibe checks, iterate blindly, and can't measure whether a prompt change improved anything. Manual spot-checking doesn't survive scale or a second engineer. Evals are to AI products what unit tests are to software — the up-front cost that makes every later change cheap and safe.

Why it happens

Manual spot-checking is unmeasured and unrepeatable, so it cannot tell you whether a prompt change improved anything or just changed something, and it collapses the moment a second engineer or a tenth example enters the picture. Generic off-the-shelf metrics do not rescue you either: practitioners report that n-gram and embedding-similarity scores prove unreliable or impractical for real tasks, which is why task-specific, re-runnable assertions are the unit that actually holds. The core asymmetry is the same as unit tests in software: the up-front cost of an eval harness is what makes every later change cheap, safe, and comparable. Without that harness you are iterating blind, and blind iteration on a non-deterministic system tends to trade one failure for another you never see.

Watch for

In practice

Your team iterates on the summarization prompt by eyeballing a few outputs in the playground, nodding, and shipping. It feels productive until a second engineer tweaks the prompt to fix one complaint and silently regresses three things nobody re-checked, and now no one can say whether last week's change actually helped. Vibe checks do not survive a second person or a tenth example. Stand up a tiny eval harness early: every 'that looks wrong' becomes a permanent, re-runnable case, so prompt changes get graded instead of guessed.

Apply it

  1. Stand up a small re-runnable eval set before scaling, and run it on every prompt or model change.
  2. Turn every that looks wrong moment into a permanent test case with an expected outcome.
  3. Prefer task-specific checks over generic similarity scores, since the latter often fail to track real quality.

The takeaway

Build a small eval harness before you scale. Turn every 'that looks wrong' moment into a permanent, re-runnable test case.

Sources and further reading

Related laws

Read every law in the digital edition Back to all 50 laws