Law 05 · Context & Reliability

Design for the Worst Case

Plan around the ceiling, not the average.

The principle

When a system says 'up to 24 hours', 'may retry', or 'no guaranteed latency', those bounds are the numbers that matter. Designing around the typical case works right up until the tail event — which is precisely when failure is most expensive. Failures aren't edge cases; at scale they're the steady state.

Why it happens

At scale the tail is not rare, it is the steady state, because every request rolls the dice against the full latency distribution and a system handling millions of calls hits the 99.9th percentile constantly. Dean and Barroso's The Tail at Scale quantifies this: in a Google service the 99th-percentile latency for a single request was 10ms, but waiting on all of a fan-out's requests pushed the 99th percentile to 140ms, and the slowest 5% of requests accounted for half of that tail. The same logic governs any up to 24 hours or may retry bound: those words define the worst plausible run, and at volume that run will happen. Designing dedup windows, timeouts, and retry budgets around the typical case works right up until the tail event, which is exactly when failure is most expensive, so you size against the ceiling instead.

Watch for

Timeouts, dedup windows, or retry budgets are set to the typical latency rather than the documented maximum.
Failures cluster at peak load or month-end, exactly when the system is most exercised.
A spec says up to X or may and the design quietly assumed the average instead.

In practice

The webhook docs say delivery may be retried for up to 24 hours and you build assuming events arrive once, within seconds, so your dedup window is 5 minutes and your timeout is 10 seconds. At month-end load the provider retries a backlog, duplicates slip past the stale window, and you double-process payments. Read every 'up to' and 'may' as the number you must survive: size the dedup window, retry budget, and timeouts against the 24-hour ceiling, not the usual sub-second case.

Apply it

Read every up to and may as the number you must survive, and do the math against that ceiling.
Size timeouts, dedup windows, and retry budgets for the worst plausible run, not the common one.
Load-test at the tail and the peak, since at scale the rare path becomes the routine one.

The takeaway

Whenever you're handed a maximum or a 'may', do the math against the ceiling. Size timeouts, retry budgets, and SLAs for the worst plausible run, not the one you usually see.

Sources and further reading

Read every law in the digital edition Back to all 50 laws

The principle

Why it happens

Watch for

Apply it

Sources and further reading

Related laws