Law 27 · Evaluation & Measurement

Look at Your Data

The highest-ROI activity in AI is the one teams skip first.

The principle

Error analysis — manually reading your app's actual traces to find where it fails — is the single most valuable activity in AI development, yet teams skip it for dashboards and vanity metrics that improve while users still struggle. You cannot write a good eval for a failure mode you've never seen, and you only see failure modes by reading transcripts.

Why it happens

You cannot write an eval for a failure mode you have never seen, and the only way to see your real failure modes is to read actual production traces rather than dashboard aggregates. The structured version of this is error analysis: read a sample of traces, write open-ended notes on what went wrong, then cluster those notes into recurring failure categories that become your eval targets. Research on this loop surfaced criteria drift, the finding that the act of grading outputs is what reveals the criteria, so it is impossible to fully specify what to measure before you have looked at outputs. This is why vanity dashboards can climb while users still churn: the metric was chosen before anyone understood the failures, so it measures the wrong thing.

Watch for

A helpfulness or quality dashboard is climbing while user complaints or churn are not improving.
Your eval categories were defined before anyone read a single real transcript.
Nobody on the team can name the top three concrete ways the system actually fails in production.

In practice

Instead of reading transcripts, the team buys an eval platform and watches a 'helpfulness score' dashboard climb while users keep churning. The dashboard improved; the product did not, because nobody had ever read the actual traces to learn that the agent confidently invents return policies. You cannot write an eval for a failure mode you have never witnessed. Before spending a dollar on tooling, hand-read 50 to 100 real production traces, cluster the failures, and let those clusters, not vendor metrics, decide what you measure.

Apply it

Hand-read a sample of real traces, jotting open notes on each failure before counting anything.
Cluster those notes into recurring failure categories and let the clusters define what you measure.
Expect your criteria to shift as you read, and revise the eval set instead of freezing it too early.

The takeaway

Before buying an eval platform, hand-read 50–100 real traces and cluster the failures. Let those clusters define what you measure.

Sources and further reading

Read every law in the digital edition Back to all 50 laws

The principle

Why it happens

Watch for

Apply it

Sources and further reading

Related laws