Brain ≠ Hands: Dissecting Anthropic's Managed Agents Architecture

On April 8, 2026, Anthropic published Scaling Managed Agents: Decoupling the brain from the hands. It’s the most detailed public description of production-grade Agent infrastructure to date.

This post isn’t a recap. We want to dissect the design decisions — why they made them, what the trade-offs are, and which parts are genuinely counterintuitive.

Starting Point: Three Failure Modes of the Monolith

Anthropic initially packed everything into a single container — session state, the agent orchestration loop (Harness), and the code execution sandbox all shared one environment.

The problems:

Container dies, state dies with it. Session lived in-process with no independent persistence
No security boundary. Claude’s untrusted generated code ran alongside credentials — a prompt injection just needed to convince Claude to read its own environment variables
Undebuggable. Debugging required shelling into the container, but the container held user data, so engineers shouldn’t be touching it
Deployment coupling. Connecting Claude to a customer’s VPC required network peering because the Harness assumed all resources were co-located

The Solution: Three Virtualized Components

Anthropic’s core analogy is the operating system — OS virtualized hardware into abstractions like process and file. read() doesn’t care whether it’s hitting a 1970s disk pack or a modern SSD. Interfaces stay stable while implementations change freely.

Managed Agents does the same: virtualize the Agent into three independent components.

Session — Durable Event Log

Append-only event log, the system’s single source of truth
Lives outside Claude’s context window
API: emitEvent(id, event) to write, getEvents() to read, with support for rewind, slice, and positional access

Session is a persistent event stream. Everything that happens — user messages, Claude outputs, tool call results — gets appended. Never deleted, never mutated.

Harness — Stateless Orchestrator

Classic Agent Loop: call Claude API → parse tool calls → route to execution environment → write results back to Session → loop
Fully stateless: recovers from crashes via wake(sessionId) + getSession(id), resumes from the last event
Responsible for context engineering: pull events from Session → trim/transform → feed into Claude’s context window

Sandbox — Disposable Execution Environment

Container-isolated, provisioned on demand via provision({resources})
Lazy init: only spins up when Claude actually needs to execute code
Container dies → Harness treats it as a tool error → Claude decides whether to retry → new container rebuilt from a standard recipe

Two Genuinely Counterintuitive Designs

The three-layer split itself isn’t novel. What matters are the two specific decoupling decisions.

1. Session ≠ Context Window

Many Agent frameworks conflate “context management” with “state storage.” The typical approach: when context fills up, compress it (compaction). The compressed output serves as both storage and context — information lost is lost forever.

Anthropic explicitly separates the two:

Session handles persistent storage only. Append-only, complete, uncompressed
Context engineering is the Harness’s job — what to pull from Session, how to trim it, how to fit it into Claude’s context window. All implementation details of the Harness

From the original post:

We separated the concerns of recoverable context storage in the session and arbitrary context management in the harness because we can’t predict what specific context engineering will be required in future models.

This means context engineering strategies can evolve with model capabilities (their example: Sonnet 4.5 had “context anxiety” requiring resets; Opus 4.5 eliminated this behavior, making resets dead weight), while the raw data in Session is always there. You can always go back and re-process with a new strategy.

2. Tool Execution Doesn’t Live Next to the Agent

The traditional mental model: an Agent is like opening a terminal — reasoning and execution share one environment. Anthropic physically separates “thinking” from “doing”:

Brain = Claude + Harness (reasoning, long-lived, stateless)
Hands = Sandbox + Tools (execution, on-demand, isolated)

Interface: execute(name, input) → string

One interface between Brain and Hands. The Harness doesn’t know whether the Sandbox is a container, a VPC, a phone, or — in the original post’s words — “a Pokémon emulator.” Anything implementing execute(name, input) → string is a valid Hand.

Direct benefits:

Security as an architectural property, not a policy. Code in the Sandbox physically cannot access any credential
- Git tokens: injected into local git remote URLs at init time. Claude uses git push/pull but never sees the token itself
- OAuth: tokens live in an external vault. Claude calls through an MCP proxy; the proxy uses session tokens to fetch real credentials from the vault
Cost: many sessions never need an execution environment (pure conversation/reasoning). No sandbox = no cost
Fault isolation: sandbox failure is just a tool error. Brain stays alive

The Truth About TTFT Optimization

The paper reports TTFT (Time To First Token) p50 down ~60%, p95 down >90%.

The reason is mundane: inference no longer waits for container boot.

Old architecture: Session starts → boot container (clone repo + setup) → container ready → start reasoning

New architecture: Session starts → Harness pulls events from Session → start reasoning immediately → Claude says “I need to run code” → only then provision() a sandbox

The TTFT improvement isn’t magic. It’s just moving container startup off the critical path. Most sessions don’t need code execution on their first turn.

Sandbox Is Cattle, but Not Single-Use

Here’s an easy misread. The paper repeatedly says “cattle, not pets,” which sounds like “new container per tool call.” But look at the actual model:

Session starts
  → Harness starts (stateless, immediate reasoning)
  → Claude first needs to run code
    → provision() spins up Sandbox
  → Subsequent tool calls reuse the same Sandbox
  → Sandbox dies → provision() a new one → restore from repo
Session ends or times out
  → Sandbox destroyed

“Cattle” means “replaceable when it dies,” not “disposable after one use.” Multiple tool calls in a long-running task share the same sandbox’s filesystem state — otherwise coding workflows (read file → edit code → run tests → check results) simply wouldn’t work.

Many Brains, Many Hands

Decoupling unlocks topological freedom:

Many Brains: stateless Harnesses scale horizontally. No more one-container-per-session
Many Hands: one Brain operates multiple execution environments — your VPC, my container, their database
Brains pass Hands to each other: since Hands aren’t coupled to any specific Brain

That last point hints at multi-agent collaboration — agent A does half the work in a sandbox, then hands the sandbox handle to agent B to continue.

Same Problem, Different Cuts: GCP Agent Engine

Anthropic isn’t the only one solving this. Google Cloud’s Vertex AI Agent Engine also offers production-grade Agent hosting, but with a fundamentally different design philosophy.

GCP’s approach is PaaS: we manage Session and Memory for you, you focus on Agent logic.

VertexAiSessionService — managed session persistence
VertexAiMemoryBankService — managed long-term memory
Agent code runs on Cloud Run or Agent Engine’s managed runtime
Open ecosystem: Agent2Agent (A2A) protocol + MCP integration

Anthropic’s approach is OS: we define abstraction layers and interfaces, concrete implementations are swappable.

The key difference is execution isolation. GCP Agent Engine solves state persistence (sessions don’t die with the process), but Agent code and tool execution still share the same Cloud Run service — there’s no Brain-Hands separation. The execute(name, input) → string interface that physically separates reasoning from execution has no counterpart on the GCP side.

From another angle, this reflects the two companies’ different positions:

Google sells cloud infrastructure. Agent Engine is an on-ramp to more GCP services. Openness (A2A, MCP) is a competitive strategy
Anthropic sells model capability. Managed Agents is the infrastructure that lets Claude run in more complex scenarios. Closed but deeply optimized

The two approaches aren’t mutually exclusive — they’re potentially complementary. Running Anthropic’s Brain-Hands topology on GCP’s infrastructure is theoretically viable.

What They Didn’t Say

The paper has several notable gaps:

No discussion of tool call latency. Every execute() involves a network roundtrip. What’s the impact on high-frequency tool call patterns (e.g., Claude Code’s rapid-fire file reads and edits)? Is there a warm pool or co-location optimization?
Customer cases exist but scenario differences aren’t explored. Notion, Rakuten, Asana, and Sentry are already running Managed Agents in production (Notion’s use case: Claude picks up tasks from a team board and executes within the workspace). The paper also mentions both Claude Code and task-specific harnesses can run on this architecture. But different scenarios clearly need different Brain-Hands topologies — does a high-frequency tool-calling coding agent use the same setup as a low-frequency async task agent? This isn’t addressed
Session storage implementation. What backs the append-only event log? What’s the retention policy? What’s the rewind granularity?
Specific context engineering strategies. The paper says the Harness handles context engineering but doesn’t elaborate — full injection? Sliding window? Summarization? Only one example is given (context anxiety/reset)

These gaps are reasonable — this is an architecture philosophy paper, not an implementation doc. But if you’re building your own system on these ideas, these blanks need filling.

Conclusion

Managed Agents’ core contribution isn’t “Agents should be layered” — everyone knows that. Its contribution is making explicit where to cut:

Storage (Session) and context engineering (Harness) must be separate. Your context engineering strategy will change with model iterations, but raw data shouldn’t be lost
Reasoning (Brain) and execution (Hands) must be physically isolated. Not for architectural elegance — because security boundaries are only reliable when physically enforced
Interfaces should outlast implementations. execute(name, input) → string is simple enough and general enough to survive whatever comes next

Quoting the original post’s closing:

The challenge we faced is an old one: how to design a system for “programs as yet unthought of.”

Applied to Agent infrastructure: don’t design for what models can do today. Design for the day you don’t yet know what they’ll be capable of.

Original post: Scaling Managed Agents: Decoupling the brain from the hands