Skip to content

Brain ≠ Hands: Dissecting Anthropic's Managed Agents Architecture

MasakiMu319 ·

On April 8, 2026, Anthropic published Scaling Managed Agents: Decoupling the brain from the hands. It’s the most detailed public description of production-grade Agent infrastructure to date.

This post isn’t a recap. We want to dissect the design decisions — why they made them, what the trade-offs are, and which parts are genuinely counterintuitive.


Starting Point: Three Failure Modes of the Monolith

Anthropic initially packed everything into a single container — session state, the agent orchestration loop (Harness), and the code execution sandbox all shared one environment.

The problems:

  1. Container dies, state dies with it. Session lived in-process with no independent persistence
  2. No security boundary. Claude’s untrusted generated code ran alongside credentials — a prompt injection just needed to convince Claude to read its own environment variables
  3. Undebuggable. Debugging required shelling into the container, but the container held user data, so engineers shouldn’t be touching it
  4. Deployment coupling. Connecting Claude to a customer’s VPC required network peering because the Harness assumed all resources were co-located

The Solution: Three Virtualized Components

Anthropic’s core analogy is the operating system — OS virtualized hardware into abstractions like process and file. read() doesn’t care whether it’s hitting a 1970s disk pack or a modern SSD. Interfaces stay stable while implementations change freely.

Managed Agents does the same: virtualize the Agent into three independent components.

Session — Durable Event Log

Session is a persistent event stream. Everything that happens — user messages, Claude outputs, tool call results — gets appended. Never deleted, never mutated.

Harness — Stateless Orchestrator

Sandbox — Disposable Execution Environment


Two Genuinely Counterintuitive Designs

The three-layer split itself isn’t novel. What matters are the two specific decoupling decisions.

1. Session ≠ Context Window

Many Agent frameworks conflate “context management” with “state storage.” The typical approach: when context fills up, compress it (compaction). The compressed output serves as both storage and context — information lost is lost forever.

Anthropic explicitly separates the two:

From the original post:

We separated the concerns of recoverable context storage in the session and arbitrary context management in the harness because we can’t predict what specific context engineering will be required in future models.

This means context engineering strategies can evolve with model capabilities (their example: Sonnet 4.5 had “context anxiety” requiring resets; Opus 4.5 eliminated this behavior, making resets dead weight), while the raw data in Session is always there. You can always go back and re-process with a new strategy.

2. Tool Execution Doesn’t Live Next to the Agent

The traditional mental model: an Agent is like opening a terminal — reasoning and execution share one environment. Anthropic physically separates “thinking” from “doing”:

Brain = Claude + Harness (reasoning, long-lived, stateless)
Hands = Sandbox + Tools (execution, on-demand, isolated)

Interface: execute(name, input) → string

One interface between Brain and Hands. The Harness doesn’t know whether the Sandbox is a container, a VPC, a phone, or — in the original post’s words — “a Pokémon emulator.” Anything implementing execute(name, input) → string is a valid Hand.

Direct benefits:


The Truth About TTFT Optimization

The paper reports TTFT (Time To First Token) p50 down ~60%, p95 down >90%.

The reason is mundane: inference no longer waits for container boot.

Old architecture: Session starts → boot container (clone repo + setup) → container ready → start reasoning

New architecture: Session starts → Harness pulls events from Session → start reasoning immediately → Claude says “I need to run code” → only then provision() a sandbox

The TTFT improvement isn’t magic. It’s just moving container startup off the critical path. Most sessions don’t need code execution on their first turn.


Sandbox Is Cattle, but Not Single-Use

Here’s an easy misread. The paper repeatedly says “cattle, not pets,” which sounds like “new container per tool call.” But look at the actual model:

Session starts
  → Harness starts (stateless, immediate reasoning)
  → Claude first needs to run code
    → provision() spins up Sandbox
  → Subsequent tool calls reuse the same Sandbox
  → Sandbox dies → provision() a new one → restore from repo
Session ends or times out
  → Sandbox destroyed

“Cattle” means “replaceable when it dies,” not “disposable after one use.” Multiple tool calls in a long-running task share the same sandbox’s filesystem state — otherwise coding workflows (read file → edit code → run tests → check results) simply wouldn’t work.


Many Brains, Many Hands

Decoupling unlocks topological freedom:

That last point hints at multi-agent collaboration — agent A does half the work in a sandbox, then hands the sandbox handle to agent B to continue.


Same Problem, Different Cuts: GCP Agent Engine

Anthropic isn’t the only one solving this. Google Cloud’s Vertex AI Agent Engine also offers production-grade Agent hosting, but with a fundamentally different design philosophy.

GCP’s approach is PaaS: we manage Session and Memory for you, you focus on Agent logic.

Anthropic’s approach is OS: we define abstraction layers and interfaces, concrete implementations are swappable.

The key difference is execution isolation. GCP Agent Engine solves state persistence (sessions don’t die with the process), but Agent code and tool execution still share the same Cloud Run service — there’s no Brain-Hands separation. The execute(name, input) → string interface that physically separates reasoning from execution has no counterpart on the GCP side.

From another angle, this reflects the two companies’ different positions:

The two approaches aren’t mutually exclusive — they’re potentially complementary. Running Anthropic’s Brain-Hands topology on GCP’s infrastructure is theoretically viable.


What They Didn’t Say

The paper has several notable gaps:

  1. No discussion of tool call latency. Every execute() involves a network roundtrip. What’s the impact on high-frequency tool call patterns (e.g., Claude Code’s rapid-fire file reads and edits)? Is there a warm pool or co-location optimization?
  2. Customer cases exist but scenario differences aren’t explored. Notion, Rakuten, Asana, and Sentry are already running Managed Agents in production (Notion’s use case: Claude picks up tasks from a team board and executes within the workspace). The paper also mentions both Claude Code and task-specific harnesses can run on this architecture. But different scenarios clearly need different Brain-Hands topologies — does a high-frequency tool-calling coding agent use the same setup as a low-frequency async task agent? This isn’t addressed
  3. Session storage implementation. What backs the append-only event log? What’s the retention policy? What’s the rewind granularity?
  4. Specific context engineering strategies. The paper says the Harness handles context engineering but doesn’t elaborate — full injection? Sliding window? Summarization? Only one example is given (context anxiety/reset)

These gaps are reasonable — this is an architecture philosophy paper, not an implementation doc. But if you’re building your own system on these ideas, these blanks need filling.


Conclusion

Managed Agents’ core contribution isn’t “Agents should be layered” — everyone knows that. Its contribution is making explicit where to cut:

  1. Storage (Session) and context engineering (Harness) must be separate. Your context engineering strategy will change with model iterations, but raw data shouldn’t be lost
  2. Reasoning (Brain) and execution (Hands) must be physically isolated. Not for architectural elegance — because security boundaries are only reliable when physically enforced
  3. Interfaces should outlast implementations. execute(name, input) → string is simple enough and general enough to survive whatever comes next

Quoting the original post’s closing:

The challenge we faced is an old one: how to design a system for “programs as yet unthought of.”

Applied to Agent infrastructure: don’t design for what models can do today. Design for the day you don’t yet know what they’ll be capable of.


Original post: Scaling Managed Agents: Decoupling the brain from the hands


Previous Humans as Weak Supervisors: What AAR Reveals About Alignment Next The Agent Trapped in the Same River