On April 8, 2026, Anthropic published Scaling Managed Agents: Decoupling the brain from the hands. It’s the most detailed public description of production-grade Agent infrastructure to date.
This post isn’t a recap. We want to dissect the design decisions — why they made them, what the trade-offs are, and which parts are genuinely counterintuitive.
Starting Point: Three Failure Modes of the Monolith
Anthropic initially packed everything into a single container — session state, the agent orchestration loop (Harness), and the code execution sandbox all shared one environment.
The problems:
- Container dies, state dies with it. Session lived in-process with no independent persistence
- No security boundary. Claude’s untrusted generated code ran alongside credentials — a prompt injection just needed to convince Claude to read its own environment variables
- Undebuggable. Debugging required shelling into the container, but the container held user data, so engineers shouldn’t be touching it
- Deployment coupling. Connecting Claude to a customer’s VPC required network peering because the Harness assumed all resources were co-located
The Solution: Three Virtualized Components
Anthropic’s core analogy is the operating system — OS virtualized hardware into abstractions like process and file. read() doesn’t care whether it’s hitting a 1970s disk pack or a modern SSD. Interfaces stay stable while implementations change freely.
Managed Agents does the same: virtualize the Agent into three independent components.
Session — Durable Event Log
- Append-only event log, the system’s single source of truth
- Lives outside Claude’s context window
- API:
emitEvent(id, event)to write,getEvents()to read, with support for rewind, slice, and positional access
Session is a persistent event stream. Everything that happens — user messages, Claude outputs, tool call results — gets appended. Never deleted, never mutated.
Harness — Stateless Orchestrator
- Classic Agent Loop: call Claude API → parse tool calls → route to execution environment → write results back to Session → loop
- Fully stateless: recovers from crashes via
wake(sessionId)+getSession(id), resumes from the last event - Responsible for context engineering: pull events from Session → trim/transform → feed into Claude’s context window
Sandbox — Disposable Execution Environment
- Container-isolated, provisioned on demand via
provision({resources}) - Lazy init: only spins up when Claude actually needs to execute code
- Container dies → Harness treats it as a tool error → Claude decides whether to retry → new container rebuilt from a standard recipe
Two Genuinely Counterintuitive Designs
The three-layer split itself isn’t novel. What matters are the two specific decoupling decisions.
1. Session ≠ Context Window
Many Agent frameworks conflate “context management” with “state storage.” The typical approach: when context fills up, compress it (compaction). The compressed output serves as both storage and context — information lost is lost forever.
Anthropic explicitly separates the two:
- Session handles persistent storage only. Append-only, complete, uncompressed
- Context engineering is the Harness’s job — what to pull from Session, how to trim it, how to fit it into Claude’s context window. All implementation details of the Harness
From the original post:
We separated the concerns of recoverable context storage in the session and arbitrary context management in the harness because we can’t predict what specific context engineering will be required in future models.
This means context engineering strategies can evolve with model capabilities (their example: Sonnet 4.5 had “context anxiety” requiring resets; Opus 4.5 eliminated this behavior, making resets dead weight), while the raw data in Session is always there. You can always go back and re-process with a new strategy.
2. Tool Execution Doesn’t Live Next to the Agent
The traditional mental model: an Agent is like opening a terminal — reasoning and execution share one environment. Anthropic physically separates “thinking” from “doing”:
Brain = Claude + Harness (reasoning, long-lived, stateless)
Hands = Sandbox + Tools (execution, on-demand, isolated)
Interface: execute(name, input) → string
One interface between Brain and Hands. The Harness doesn’t know whether the Sandbox is a container, a VPC, a phone, or — in the original post’s words — “a Pokémon emulator.” Anything implementing execute(name, input) → string is a valid Hand.
Direct benefits:
- Security as an architectural property, not a policy. Code in the Sandbox physically cannot access any credential
- Git tokens: injected into local git remote URLs at init time. Claude uses git push/pull but never sees the token itself
- OAuth: tokens live in an external vault. Claude calls through an MCP proxy; the proxy uses session tokens to fetch real credentials from the vault
- Cost: many sessions never need an execution environment (pure conversation/reasoning). No sandbox = no cost
- Fault isolation: sandbox failure is just a tool error. Brain stays alive
The Truth About TTFT Optimization
The paper reports TTFT (Time To First Token) p50 down ~60%, p95 down >90%.
The reason is mundane: inference no longer waits for container boot.
Old architecture: Session starts → boot container (clone repo + setup) → container ready → start reasoning
New architecture: Session starts → Harness pulls events from Session → start reasoning immediately → Claude says “I need to run code” → only then provision() a sandbox
The TTFT improvement isn’t magic. It’s just moving container startup off the critical path. Most sessions don’t need code execution on their first turn.
Sandbox Is Cattle, but Not Single-Use
Here’s an easy misread. The paper repeatedly says “cattle, not pets,” which sounds like “new container per tool call.” But look at the actual model:
Session starts
→ Harness starts (stateless, immediate reasoning)
→ Claude first needs to run code
→ provision() spins up Sandbox
→ Subsequent tool calls reuse the same Sandbox
→ Sandbox dies → provision() a new one → restore from repo
Session ends or times out
→ Sandbox destroyed
“Cattle” means “replaceable when it dies,” not “disposable after one use.” Multiple tool calls in a long-running task share the same sandbox’s filesystem state — otherwise coding workflows (read file → edit code → run tests → check results) simply wouldn’t work.
Many Brains, Many Hands
Decoupling unlocks topological freedom:
- Many Brains: stateless Harnesses scale horizontally. No more one-container-per-session
- Many Hands: one Brain operates multiple execution environments — your VPC, my container, their database
- Brains pass Hands to each other: since Hands aren’t coupled to any specific Brain
That last point hints at multi-agent collaboration — agent A does half the work in a sandbox, then hands the sandbox handle to agent B to continue.
Same Problem, Different Cuts: GCP Agent Engine
Anthropic isn’t the only one solving this. Google Cloud’s Vertex AI Agent Engine also offers production-grade Agent hosting, but with a fundamentally different design philosophy.
GCP’s approach is PaaS: we manage Session and Memory for you, you focus on Agent logic.
VertexAiSessionService— managed session persistenceVertexAiMemoryBankService— managed long-term memory- Agent code runs on Cloud Run or Agent Engine’s managed runtime
- Open ecosystem: Agent2Agent (A2A) protocol + MCP integration
Anthropic’s approach is OS: we define abstraction layers and interfaces, concrete implementations are swappable.
The key difference is execution isolation. GCP Agent Engine solves state persistence (sessions don’t die with the process), but Agent code and tool execution still share the same Cloud Run service — there’s no Brain-Hands separation. The execute(name, input) → string interface that physically separates reasoning from execution has no counterpart on the GCP side.
From another angle, this reflects the two companies’ different positions:
- Google sells cloud infrastructure. Agent Engine is an on-ramp to more GCP services. Openness (A2A, MCP) is a competitive strategy
- Anthropic sells model capability. Managed Agents is the infrastructure that lets Claude run in more complex scenarios. Closed but deeply optimized
The two approaches aren’t mutually exclusive — they’re potentially complementary. Running Anthropic’s Brain-Hands topology on GCP’s infrastructure is theoretically viable.
What They Didn’t Say
The paper has several notable gaps:
- No discussion of tool call latency. Every
execute()involves a network roundtrip. What’s the impact on high-frequency tool call patterns (e.g., Claude Code’s rapid-fire file reads and edits)? Is there a warm pool or co-location optimization? - Customer cases exist but scenario differences aren’t explored. Notion, Rakuten, Asana, and Sentry are already running Managed Agents in production (Notion’s use case: Claude picks up tasks from a team board and executes within the workspace). The paper also mentions both Claude Code and task-specific harnesses can run on this architecture. But different scenarios clearly need different Brain-Hands topologies — does a high-frequency tool-calling coding agent use the same setup as a low-frequency async task agent? This isn’t addressed
- Session storage implementation. What backs the append-only event log? What’s the retention policy? What’s the rewind granularity?
- Specific context engineering strategies. The paper says the Harness handles context engineering but doesn’t elaborate — full injection? Sliding window? Summarization? Only one example is given (context anxiety/reset)
These gaps are reasonable — this is an architecture philosophy paper, not an implementation doc. But if you’re building your own system on these ideas, these blanks need filling.
Conclusion
Managed Agents’ core contribution isn’t “Agents should be layered” — everyone knows that. Its contribution is making explicit where to cut:
- Storage (Session) and context engineering (Harness) must be separate. Your context engineering strategy will change with model iterations, but raw data shouldn’t be lost
- Reasoning (Brain) and execution (Hands) must be physically isolated. Not for architectural elegance — because security boundaries are only reliable when physically enforced
- Interfaces should outlast implementations.
execute(name, input) → stringis simple enough and general enough to survive whatever comes next
Quoting the original post’s closing:
The challenge we faced is an old one: how to design a system for “programs as yet unthought of.”
Applied to Agent infrastructure: don’t design for what models can do today. Design for the day you don’t yet know what they’ll be capable of.
Original post: Scaling Managed Agents: Decoupling the brain from the hands