Law 41 · Architecture & Operations

Trip the Breaker

Stop calling the thing that's already failing.

Diagram explaining Trip the Breaker

The principle

A downstream model or tool that's timing out doesn't get healthier by being called more — it gets worse, while your agents pile up holding open connections and burning latency budget. A circuit breaker wraps the call so that once failures cross a threshold it trips: further calls fail fast instead of hanging, giving the dependency room to recover.

Why it happens

A dependency that is timing out does not recover by being called more; the extra load deepens its overload while callers pile up holding open connections and draining their own latency budget, which is the mechanism of a cascading failure. A circuit breaker, formalized by Nygard and Fowler, wraps the call in a small state machine: it counts failures, and once they cross a threshold it opens so further calls fail fast instead of hanging, then after a cooldown it goes half-open to let a probe request test recovery before closing again. Failing fast is the point, it converts an indefinite hang into a predictable, immediate error the agent can degrade against, and it gives the sick dependency room to recover instead of being hammered. Google's SRE guidance pairs this with shedding retries and traffic upstream once total load exceeds capacity, because uncontrolled retries are a primary driver of the cascade the breaker exists to stop.

Watch for

In practice

A downstream embedding service starts timing out, and your agents respond by hammering it harder on every retry, piling up open connections and dragging the whole run's latency into the floor while the sick dependency gets sicker. Calling a failing service more never heals it. Wrap that dependency in a circuit breaker: once failures cross a threshold it trips and calls fail fast instead of hanging, then it periodically probes for recovery. Your agents degrade gracefully on a known error path instead of stalling indefinitely behind a dependency that is not coming back.

Apply it

  1. Wrap every external model and tool dependency in a breaker that opens after a failure threshold and fails fast.
  2. After a cooldown, let a single probe test recovery before resuming full traffic.
  3. Shed retries and traffic upstream when load exceeds capacity so retries do not amplify the cascade.

The takeaway

Wrap every external model and tool dependency in a circuit breaker that fails fast after a failure threshold, then probes for recovery — don't let a sick dependency drag the whole run down.

Sources and further reading

Related laws

Read every law in the digital edition Back to all 50 laws