Law 35 · Safety & Security
Sandbox the Blast Radius
Assume the agent gets compromised — then contain what it can reach.

The principle
Defense in depth means planning for the injection that succeeds. Containing an agent with filesystem isolation (scoping access to specific directories) and network isolation (blocking exfiltration) means a compromised agent can't reach beyond its sandbox. Real incidents — CI agents that could leak secrets via untrusted content — show why the second layer matters when the first fails.
Why it happens
Sandboxing is the layer you reach for precisely because prompt-injection prevention is not reliable: you assume the injection eventually succeeds and engineer so that success is contained rather than catastrophic. The two controls that matter are filesystem isolation (scoping the agent to a single working directory so it cannot read credentials or unrelated data) and network isolation (an egress allowlist so a compromised agent cannot POST stolen secrets to an attacker-controlled host). This is the classic defense-in-depth posture, and Google's 2025 agent-security framework frames it as deterministic guardrails enforced outside the model, wrapping the reasoning layer that can never be fully trusted. Real CI incidents make the case concrete: agents running untrusted PR branches with cloud credentials in environment variables and open egress have been steered into reading those secrets and exfiltrating them on the first attempt, which a directory-scoped container with a registry-only allowlist would have reduced to a harmless dead end.
Watch for
- Agent tool execution runs with the full host environment, including credentials in environment variables.
- The agent has unrestricted outbound network access rather than an allowlist of required destinations.
- A successful injection could read or write files well outside the task's intended working directory.
In practice
Your CI agent runs untrusted PR branches and has the build runner's full environment, including the cloud credentials sitting in env vars and open egress to the internet. A contributor's PR adds a test that reads those secrets and POSTs them to their server, and the injection succeeds on the first try. Defense in depth assumes exactly this. Run agent tool execution in a container scoped to the one working directory, with an egress allowlist that blocks everything but the registries you need, so a successful compromise is a contained annoyance instead of a credential leak.
Apply it
- Run tool execution in an isolated environment scoped to a single working directory with no access to ambient secrets.
- Enforce an egress allowlist that blocks all outbound traffic except the specific destinations the task requires.
- Design assuming the injection succeeds, and verify that the worst reachable outcome is contained, not catastrophic.
The takeaway
Run agent tool execution in an isolated environment with constrained filesystem and network access, so a successful injection is contained instead of catastrophic.