Hard-won heuristics for building agents that actually work.
Not proven theorems — field notes, each backed by a real source. Fifty model-agnostic principles spanning context, reasoning, retrieval, scope, instruction, evaluation, safety, architecture, operations, and the humans in the loop. Inspired by the format of Laws of UX; every card carries its receipt.
Most bad outputs trace to missing, stale, or poisoned context — not a model that can't think. The model is usually smart enough; it was just reasoning over the wrong picture of the world. Garbage context produces confident garbage, and the confidence is exactly what makes it dangerous.
The takeaway
Before you reach for a bigger model, audit what the agent could actually see. Curate the context window deliberately — fresh, relevant, free of contradictions — and most 'reasoning' failures quietly disappear.
A step that's 95% reliable, run ten times in sequence, lands correct only about 60% of the time. The failures don't announce themselves — they accumulate quietly until the final answer is wrong and you can't tell which step broke it. Every link you add lowers the ceiling of the whole chain.
The takeaway
Count your steps. Shorten the chain, raise per-step reliability, and checkpoint between stages so a single bad step can't silently poison everything downstream.
Given a long input, a model attends most reliably to the very beginning and the very end. Critical facts buried in the middle quietly lose their grip — present but functionally ignored. The information was technically 'in context' and still got missed, which is the worst kind of bug because nothing looks wrong.
The takeaway
Put the most important instructions and findings at the top or the bottom. Lead with a summary, structure with explicit headers, and never assume that 'in the context' means 'actually used'.
An agent will write the summary before doing the work if you let it. 'Looking finished' is cheaper than being finished, so the model drifts toward the cheaper path — a plausible report, a confident 'done', an untested claim of success. The output reads complete; the work isn't. It's specification gaming: optimizing the proxy you can see, not the goal you meant.
The takeaway
Demand evidence, not assertions. Make the agent produce the artifact — the passing test, the diff, the file, the citation — before it's allowed to claim success. Verify the proof, not the promise.
When a system says 'up to 24 hours', 'may retry', or 'no guaranteed latency', those bounds are the numbers that matter. Designing around the typical case works right up until the tail event — which is precisely when failure is most expensive. Failures aren't edge cases; at scale they're the steady state.
The takeaway
Whenever you're handed a maximum or a 'may', do the math against the ceiling. Size timeouts, retry budgets, and SLAs for the worst plausible run, not the one you usually see.
Prompting a model to reason in steps before answering measurably improves results — and for an agent the asymmetry is brutal: a reasoning trace is cheap and reversible, but an executed action (a sent email, a dropped table, a charged card) is not. Letting the model lay out its plan in tokens before it commits is the cheapest insurance you can buy.
The takeaway
Force an explicit reasoning or plan step before any tool call with side effects. Burned tokens are far cheaper than a wrong action.
A single greedy chain of thought is fragile, but sampling several independent reasoning paths and taking the majority answer yields large, consistent gains. Correct reasoning tends to converge; mistakes scatter. Agreement across independently-generated plans is a real signal you can trust before acting on something consequential.
The takeaway
For high-stakes decisions, generate the plan or answer several times and act on the consensus — not on the first chain you happened to get.
For decisions you can't take back, explore before you commit.
The principle
Tree-of-Thoughts generalizes linear reasoning into a search: generate several candidate thoughts, self-evaluate, look ahead, and backtrack instead of being trapped left-to-right. This matters most where an early decision is pivotal — exactly the situations where an agent's first irreversible action determines everything downstream. Cheap, recoverable steps don't need it; pivotal ones do.
The takeaway
When an early action is high-leverage or irreversible, have the agent generate and score several candidate plans before picking one — don't commit to the first path.
General methods plus compute beat your clever scaffolding.
The principle
The Bitter Lesson distills 70 years of AI: approaches that leverage general computation eventually crush approaches built on hand-encoded human cleverness, by a large margin. Baked-in scaffolds — elaborate prompt chains, rigid decision trees, hardcoded heuristics — buy a short-term gain and become a ceiling. Your intricate planning DSL will likely be obsoleted by the next, more capable model.
The takeaway
Prefer general, model-driven reasoning over bespoke hand-tuned logic. Build scaffolding you'd be happy to delete when the model improves.
Extra reasoning past the answer is wasted — or a wrong turn.
The principle
Reasoning models 'overthink': they pour disproportionate effort into trivial problems for minimal gain, and on harder ones, extended deliberation can talk them out of a correct initial answer. Reasoning depth has a sweet spot, not a monotonic payoff. An agent grinding tokens on a simple lookup burns latency and money; one that keeps re-deriving can reason its way to the wrong conclusion.
The takeaway
Match reasoning budget to problem difficulty. Cap thinking on easy steps, and stop once you have a confident answer instead of letting the model wander.
Your answer can only be as good as what you retrieved.
The principle
A model's parametric memory is fixed and imprecise; the retriever supplies the facts it reasons over. If the right passage never makes it into context, no amount of model intelligence recovers it — the generator confidently fills the gap instead. Retrieval quality is the hard ceiling on answer quality, not a tunable nice-to-have.
The takeaway
Measure and optimize retrieval (recall@k, hit rate) as a first-class metric before touching prompts or models. If recall is low, fix retrieval first — better generation cannot save you.
Retrieval reduces hallucination; it does not eliminate it.
The principle
Vendors marketed RAG legal tools as 'hallucination-free', yet a Stanford audit found they still hallucinated 17–33% of the time. Handing the model a source doesn't force it to use that source faithfully — it can misread, over-generalize, or cite a real document for a claim the document never makes. Grounding lowers the floor on errors; it never reaches zero.
The takeaway
Treat 'we use RAG' as risk reduction, not a safety claim. Verify that generated claims actually trace to the retrieved passage, and never advertise grounded systems as hallucination-proof.
Near-misses poison context worse than random noise.
The principle
Counterintuitively, documents that are topically related but don't answer the question are more harmful than clearly irrelevant ones — they look plausible and pull the generator toward wrong-but-adjacent answers. Stuffing more 'kind of relevant' chunks into context degrades accuracy rather than improving coverage. Precision at the top beats breadth.
The takeaway
Optimize for precision, not recall-at-any-cost. Aggressively rerank and filter out distractor chunks — a smaller, sharper context beats a padded one.
Pure semantic search quietly loses to a 40-year-old baseline.
The principle
Dense embedding retrievers dominate in-domain but frequently underperform BM25 once you leave the training distribution — exact-match terms, product codes, names, and rare jargon are where embeddings blur and lexical search shines. In-domain accuracy doesn't predict out-of-domain generalization. Combining the two is how strong systems cut retrieval failures dramatically.
The takeaway
Default to hybrid (semantic + keyword/BM25) search, not embeddings alone — especially for jargon, IDs, and out-of-domain queries. Add a reranker on top to compound the gains.
Give the agent a hierarchy, not just a bigger prompt.
The principle
Treat the context window like a computer's RAM: an agent should actively page information between a small in-context working set and large external storage, deciding what to keep, evict, and recall. Cramming everything into one flat window conflates working memory with long-term storage and hits hard limits. Durable agent memory needs explicit tiers and self-managed retrieval.
The takeaway
Architect memory in tiers — working context, recallable summaries, external stores — with explicit policies for what gets promoted or evicted, rather than relying on context length.
A scoped agent with a handful of well-chosen tools outperforms a generalist drowning in options. Every extra tool is another way to choose wrong, another branch to test, another failure to debug. Capability surface is liability surface — breadth you don't need is just risk you took on.
The takeaway
Start narrow. Add a tool only when a real task demands it, not because it might be handy someday. When selection gets unreliable, the first move is usually fewer tools, not better instructions.
Validation, schema enforcement, retries, routing, and access control are not the model's job — they're code's job. The model is for judgment under ambiguity; deterministic code is for everything that must be correct every single time. Asking a probabilistic system to guarantee a contract is asking for the 0.1% that ruins you.
The takeaway
Wrap the model in code you can trust. Let it reason in the soft middle, but put a deterministic shell around the inputs and outputs so the hard guarantees never ride on a sampling roll.
If you can't see what the agent did and why — every decision, tool call, and input — you can't safely let it act on its own. You're not trusting it; you're hoping. Autonomy without a trace is just an outage you haven't found yet, and when it breaks you'll have no way to learn why.
The takeaway
Build the trace before you grant the freedom. Make every step inspectable after the fact, then widen autonomy only as far as your visibility actually reaches.
When it's unreliable, split it — don't supersize it.
The principle
When output is inconsistent, the instinct is to throw more at the same shape: a bigger model, a longer context, more tokens. That rarely fixes a structural problem — it just dilutes attention further. Splitting the task into focused, single-purpose passes almost always beats making one overloaded pass smarter.
The takeaway
Break the work into stages that each do one thing well — analyze per-item, then reconcile across items. A focused pass beats a heroic pass trying to do everything at once.
When something misbehaves, the cheapest fix that addresses the root cause usually wins — and it's usually clearer instructions, a better tool description, or a concrete example, not a new classifier, preprocessing layer, or pipeline. Infrastructure feels like progress but often just wraps an unsolved prompt in more surface area.
The takeaway
Exhaust the prompt-level fixes before you build systems. Only add infrastructure once you've proven that words, examples, and scoping genuinely can't close the gap.
An agent is only as capable as its tools are legible.
The principle
The agent decides what to call based on how a tool reads, not on what it actually does. A vague description — 'searches the database' — gets passed over for a tool the model understands better, even a worse one. Thin tool descriptions cause more failures than thin instructions ever do.
The takeaway
Write tool descriptions like you're onboarding a sharp new engineer: what it does, when to use it (and when not to), what it expects, what it returns. The description is the interface the model actually reasons over.
If an instruction has produced the wrong result twice, writing it a third time — more precisely — rarely helps, because prose is always interpretable. Two or three concrete input/output examples eliminate the ambiguity that no amount of careful description can. Examples demonstrate the rule; prose only describes it.
The takeaway
When results are inconsistent, switch from describing to demonstrating. Show worked examples — especially the edge cases and the 'leave it blank' cases — and let the model generalize from them.
Models are routinely confident and wrong, and unconfident and right. Routing decisions on self-reported confidence inherits that miscalibration. 'Only flag high-confidence issues' or 'be conservative' just moves the noise around — it doesn't reduce it, because the confidence itself is the unreliable signal.
The takeaway
Replace confidence thresholds and vague hedges with explicit, categorical criteria: what specifically counts as in, what specifically counts as out, with an example of each. Specificity beats self-assessed certainty every time.
When the data is unclear, don't guess confidently.
The principle
Faced with two plausible matches, conflicting sources, or a missing field, an agent's instinct is to pick the 'most likely' option and move on — a confident choice that silently buries the doubt. When the stakes touch identity, money, or anything irreversible, a quiet wrong guess is far worse than an honest 'this is unclear'.
The takeaway
Make the agent escalate ambiguity instead of papering over it: ask for another identifier, preserve both conflicting values with their sources, flag the conflict for a human. Surface the doubt to whoever can actually resolve it.
An aggregate metric is a blended story that smooths over exactly the failures you most need to see. A system at 97% overall can be 99% on easy cases and 60% on the rare, hard segment where errors actually cluster. Trust the headline and you'll automate straight into the cracks it's hiding.
The takeaway
Slice before you trust. Break performance down by type, segment, and field, and require every slice to clear the bar before you act on the average. Sample deliberately for the rare cases, not just randomly.
Eyeballing outputs feels like progress until you can't tell if a change helped.
The principle
The common root cause of failed LLM products is the absence of robust evals: teams ship on vibe checks, iterate blindly, and can't measure whether a prompt change improved anything. Manual spot-checking doesn't survive scale or a second engineer. Evals are to AI products what unit tests are to software — the up-front cost that makes every later change cheap and safe.
The takeaway
Build a small eval harness before you scale. Turn every 'that looks wrong' moment into a permanent, re-runnable test case.
The highest-ROI activity in AI is the one teams skip first.
The principle
Error analysis — manually reading your app's actual traces to find where it fails — is the single most valuable activity in AI development, yet teams skip it for dashboards and vanity metrics that improve while users still struggle. You cannot write a good eval for a failure mode you've never seen, and you only see failure modes by reading transcripts.
The takeaway
Before buying an eval platform, hand-read 50–100 real traces and cluster the failures. Let those clusters define what you measure.
An LLM grader reacts to length and position, not just substance.
The principle
An LLM judge can match human preferences over 80% of the time — but only after accounting for systematic biases: position bias (favoring the first answer shown), verbosity bias (favoring longer answers regardless of quality), and self-enhancement bias (favoring its own outputs). It's a useful instrument, but an uncalibrated one that grades surface features as readily as substance.
The takeaway
Swap answer positions and average both orderings, control for length, and never let a model be the sole judge of its own family's output.
When your eval becomes the goal, it stops measuring what you cared about.
The principle
When a measure becomes a target, it ceases to be a good measure. Optimize hard against any single metric and the agent learns to game its surface form — padding answers to please a verbosity-biased judge, or memorizing the eval set — while the underlying capability stagnates or regresses. The number goes up; the thing you cared about doesn't.
The takeaway
Keep a rotating, held-out eval the optimization loop never sees. Treat any metric you actively optimize as compromised, and re-validate against fresh data.
Every fixed bug is a future regression unless it becomes a test.
The principle
LLM systems are non-deterministic and globally coupled — a prompt tweak to fix one case silently breaks three others. Rerunning real production examples against a new prompt is the only way to know you didn't break what already worked. Without a regression suite you're trapped in a whack-a-mole loop, re-discovering the same failures release after release.
The takeaway
Every failure you fix becomes a permanent case in your regression eval. Run the full suite on every prompt or model change before shipping.
Private data, untrusted content, and an exfiltration path — pick at most two.
The principle
An agent becomes exploitable the moment it combines three capabilities: access to private data, exposure to untrusted content, and the ability to communicate externally. Any single poisoned input in that pipeline can steer it into stealing your data — no code vulnerability required. Guardrails won't save you, because the model cannot reliably tell where an instruction came from.
The takeaway
Audit every agent for all three capabilities at once. If a workflow has all three, break the chain — remove a tool, isolate the data, or insert a human gate.
The model can't tell your instructions from the attacker's — they're all just tokens.
The principle
Prompt injection is architectural, not a patchable bug: the model receives system prompts, user input, and ingested content as one undifferentiated token stream and will follow any instruction in it. Injection remains unsolved, and filtering has not proven reliable enough to depend on. Design as if every piece of untrusted content is an attacker speaking in your operator's voice.
The takeaway
Never rely on 'ignore previous instructions'-style guardrails. Assume untrusted content can issue commands, and constrain what the agent can do once it has ingested any.
An agent with your privileges will wield them on an attacker's behalf.
The principle
A confused deputy is a privileged program tricked by a caller into misusing its authority — not malicious, just confused about whose intent it's serving. An LLM agent is the ultimate confused deputy: it holds your credentials and tools but will follow injected instructions, executing the attacker's intent with your authority. Ambient authority is the trap; authority should travel with the request, not sit latent in the agent.
The takeaway
Scope every tool's authority to the specific task and caller. Avoid broad ambient credentials the agent can be tricked into abusing; prefer read-only by default.
Let the privileged planner orchestrate, but never let it read the poison.
The principle
The Dual-LLM pattern splits the agent in two: a privileged model that holds tools and plans actions but never sees untrusted content, and a quarantined model that processes tainted data but has no tools and returns only opaque variables. The privileged model orchestrates the quarantined one without ever ingesting the bytes that could carry an injection. Security comes from the separation.
The takeaway
Isolate the component that reads untrusted content from the component that can act. Pass references and structured results between them, never raw tainted text.
Assume the agent gets compromised — then contain what it can reach.
The principle
Defense in depth means planning for the injection that succeeds. Containing an agent with filesystem isolation (scoping access to specific directories) and network isolation (blocking exfiltration) means a compromised agent can't reach beyond its sandbox. Real incidents — CI agents that could leak secrets via untrusted content — show why the second layer matters when the first fails.
The takeaway
Run agent tool execution in an isolated environment with constrained filesystem and network access, so a successful injection is contained instead of catastrophic.
Agents buy flexibility with latency, cost, and unpredictability.
The principle
The simplest solution that works is usually the right one — and sometimes that means not building an agentic system at all. Agents that dynamically direct their own tool use trade latency, cost, and predictability for autonomy; a workflow with predefined code paths is cheaper and more reliable for well-defined tasks. Reach for an agent only when the problem genuinely needs model-driven decisions at runtime.
The takeaway
Default to a deterministic workflow. Promote to an agent only when the task's branching is too open-ended to script.
Try the cheap model first; only the hard cases deserve the expensive one.
The principle
Most queries don't need your most powerful model. Routing requests through a cascade — a cheap model first, escalating to stronger models only when confidence is low — can match top-tier quality at a fraction of the cost. The price gap between models spans two orders of magnitude, so paying top dollar for every call is pure waste.
The takeaway
Build a cascade: answer with the cheapest model that clears your eval bar, and escalate only on low-confidence or failed cases.
Every extra agent multiplies your token bill — make sure the task can pay it.
The principle
A multi-agent research system can burn roughly 15× the tokens of a single chat, and token usage alone can explain most of the performance variance. That means multi-agent only makes economic sense when the task's value is high and the work genuinely parallelizes. For most tightly-coupled work, the coordination overhead isn't worth it.
The takeaway
Reserve multi-agent architectures for high-value, heavily parallelizable tasks. For everything else the token tax outweighs the gains.
Ship a system shaped like your teams — so design the teams first.
The principle
Any system's structure ends up a copy of the communication structure of the organization that built it. Applied to AI: if three teams each own a model, you'll get three agents and a brittle seam between them — whether or not the problem wanted to be split that way. The agent boundaries you ship will trace your team boundaries unless you consciously fight it.
The takeaway
Before drawing agent or service boundaries, check whether they reflect the problem or just your org chart — and reorganize teams to match the architecture you actually want.
If an action can run twice, a retry will eventually run it twice.
The principle
Agents retry — on timeouts, rate limits, transient errors — but a failed call that never returned may have already succeeded server-side. Without an idempotency key, the retry that 'fixes' a network blip silently double-charges the card, double-sends the email, or double-books the room. Safe retries require the server to dedupe.
The takeaway
Attach a client-generated idempotency key to every side-effecting tool call so the server can deduplicate retries. Never let an agent blindly retry a non-idempotent action.
A downstream model or tool that's timing out doesn't get healthier by being called more — it gets worse, while your agents pile up holding open connections and burning latency budget. A circuit breaker wraps the call so that once failures cross a threshold it trips: further calls fail fast instead of hanging, giving the dependency room to recover.
The takeaway
Wrap every external model and tool dependency in a circuit breaker that fails fast after a failure threshold, then probes for recovery — don't let a sick dependency drag the whole run down.
The more you automate, the harder the leftover human job becomes.
The principle
Automation doesn't shrink the human role — it transforms it into the hardest parts: passive monitoring plus rare, high-stakes intervention. Worse, by taking over the routine work, automation erodes the very skills and situational feel the operator needs when control is finally handed back. You design away the easy 95% and leave humans the 5% they're now least equipped to handle.
The takeaway
Don't just automate the happy path and dump edge cases on a human. Budget design effort for the residual role: keep the operator's context warm and make handback moments rare, clear, and well-supported.
People will trust the machine over their own eyes.
The principle
Given an automated aid, operators make errors of omission (missing problems it didn't flag) and commission (following its recommendation even when their own valid evidence contradicts it). Automation becomes a heuristic shortcut that replaces vigilant checking — so the agent's recommendation doesn't just inform the human, it overrides their independent judgment.
The takeaway
Never present an agent's output as the only signal. Force the human to confront the raw evidence alongside the recommendation, and make disagreement cheap.
Autonomy is a spectrum — from 'the computer suggests' to 'the computer acts then tells you' to 'the computer acts and decides whether to tell you at all'. The highest levels are unwise for consequential actions because no aid is perfectly reliable and the cost of a confident error is unbounded. Autonomy isn't one switch; it's a dial you set per action by how reversible and costly that action is.
The takeaway
Don't pick one autonomy level for the whole agent. Gate irreversible or high-impact actions to propose-and-confirm, while letting cheap, reversible ones run fully autonomous.
Most automation surprises start with 'what mode is it in?'
The principle
Flexible, multi-mode automation produces 'automation surprises' — the system does something unexpected because the operator lost track of which mode it was in, what it would do next, and why. As autonomy grows, the human's job shifts to tracking its state, and every hidden mode transition becomes a latent failure path. An agent that silently changes how it behaves leaves its supervisor one step from being wrong about it.
The takeaway
Make the agent's current mode, active constraints, and next intended action continuously visible — and never let it switch mode silently. Loud, legible state beats a clever agent the human can't predict.
In multi-agent systems, failures live in the seams.
The principle
Each agent can be flawless in isolation and the system still breaks — because the bug lives between them: what got passed, what got dropped, who owned the state. Sub-agents don't inherit context automatically; anything not explicitly handed over simply doesn't exist on the other side.
The takeaway
Design the contract at every boundary. Pass everything the next agent needs explicitly, make state ownership unambiguous, and validate what crosses the seam instead of assuming it survived.
People extend an agent freedom the way they extend it to a new hire — incrementally, on reversible things first, widening the leash only as it proves itself. Both failure modes are real: over-trust causes misuse, under-trust causes a good capability to be abandoned. Reliance tracks the perceived reliability the system reveals, not just its true reliability.
The takeaway
Start the agent on low-stakes, reversible actions and expand its blast radius as reliability is proven. Show why it's confident where it's strong and flag where it's weak, so users lean on it exactly where they should.
An agent with no legitimate way to say 'I'm stuck' or 'hand this to a human' will invent a path instead. Cornered without an exit — or forced to fill a required field it has no answer for — it fabricates something plausible rather than admitting the gap. A confident hallucination is the default when honesty isn't an option.
The takeaway
Always give the agent a first-class way out: an 'escalate to human' action, a nullable field, an explicit 'unknown'. Make 'I don't know' a valid, easy answer and you trade fabrications for honest gaps.
Without an external signal, a model largely fails to self-correct its own reasoning — and often makes correct answers worse by second-guessing them. The model that produced a flawed plan is the same one judging it, with the same blind spots. Real correction needs an outside signal: a tool result, a test that runs, a different model. 'Reflect and try again' on the same model with no new information is theater.
The takeaway
Separate generation from judgment. Use an independent instance — fresh context, no memory of the original reasoning — or an external check like a passing test, before trusting a 'corrected' answer.
When findings get summarized and re-summarized, the claim survives but its source, its date, and its uncertainty quietly fall away — until you're holding an assertion you can't verify or defend. Two sources disagreeing isn't noise to flatten; it's signal to keep. A fact without provenance is a rumor with good posture.
The takeaway
Carry attribution through every transformation: claim, source, date, confidence. Preserve conflicts with both sides attributed instead of silently picking a winner, so downstream can audit, weigh, and trust.
Trust & CoordinationRead the law
Further reading
The thinking these laws lean on — foundational essays, papers, and docs worth your time.