Law 02 · Context & Reliability
Compounding Error Law
Reliability multiplies, it doesn't add.

The principle
A step that's 95% reliable, run ten times in sequence, lands correct only about 60% of the time. The failures don't announce themselves — they accumulate quietly until the final answer is wrong and you can't tell which step broke it. Every link you add lowers the ceiling of the whole chain.
Why it happens
Reliability multiplies because the steps are conditionally dependent: each stage consumes the previous stage's output, so one wrong intermediate result is silently carried forward as a true premise, and 0.95 to the tenth power is roughly 0.60. The deeper cause for long agent runs is context contamination: Cognition's analysis argues that every step injects implicit decisions and conflicting assumptions that accumulate until the trajectory diverges, which is why naive retries that append the failed context make things worse rather than better. METR's 2025 measurements make the ceiling concrete: frontier agents near 100% success on tasks taking humans a few minutes but drop below 10% on tasks of several hours, precisely because long horizons mean more sequential steps and more chances for one to break the chain. The practical fix is structural, not a smarter model: shorten the chain, raise per-step reliability, and checkpoint to a verified-good state so errors cannot silently propagate.
Watch for
- End-to-end success is far worse than the per-step accuracy you measured in isolation.
- Final outputs are wrong but no single step looks obviously broken when you inspect it.
- Adding more pipeline stages keeps lowering overall reliability even as each stage tests fine.
In practice
A six-step invoice pipeline (OCR, extract line items, match vendor, validate totals, post to ledger, notify) tests at 95% per step and you ship it, then watch roughly a third of invoices come out subtly wrong with no obvious culprit. The errors are multiplicative, not additive: 0.95 to the sixth is about 0.74. Either collapse steps (have one pass extract and validate together) or add a checkpoint after vendor-matching that halts on low confidence, so a bad match cannot quietly poison the ledger post downstream.
Apply it
- Count the sequential steps and multiply their reliabilities to get the real end-to-end ceiling.
- Collapse independent steps into one pass, or raise per-step reliability, before adding new stages.
- Insert a validation checkpoint after pivotal steps that halts or restarts from the last good state on low confidence.
The takeaway
Count your steps. Shorten the chain, raise per-step reliability, and checkpoint between stages so a single bad step can't silently poison everything downstream.