Skip to content

Building Effective Agents

MasakiMu319

Both predefined workflows and long-running autonomous systems are considered agentic systems, but their architectures differ:

When to Use / Not Use Agents

When building applications with LLMs, prefer the simplest solution that can work, and only add complexity when it is necessary. In many cases, this may mean not building an agentic system at all. Agentic systems usually trade latency and cost for better task performance, and you should decide whether that trade-off is justified.

When higher complexity is expected, workflows provide predictability and consistency for well-defined tasks, while agents are better when flexibility and model-driven decision-making are required. For many applications, however, optimizing a single LLM call with retrieval and good in-context examples is sufficient.

When and How to Use Frameworks

Many frameworks make implementing agentic systems easier, including:

These frameworks simplify standard low-level tasks such as calling LLMs, defining and parsing tools, and chaining calls, so it is very easy to get started. But they often add extra abstraction layers that obscure the underlying prompts and responses, making debugging harder. They can also make it tempting to add complexity even when a simpler setup would be enough.

Our recommendation is to start directly with LLM APIs: many patterns only need a few lines of code. If you do use a framework, make sure you understand the underlying code. Wrong assumptions about internals are a common source of customer-facing bugs.

Building Blocks, Workflows, and Agents

Start from foundational building blocks and add complexity gradually.

The foundational block of an agentic system is an augmented LLM, enhanced with capabilities such as retrieval, tools, and memory. Current models can:

We recommend focusing on two key implementation aspects:

  1. Tailor these capabilities to your specific use case.
  2. Ensure they expose a suitable, well-documented interface to your LLM.

There are many ways to implement these augmentations. One example is the recently released Model Context Protocol, which allows developers to integrate with a growing third-party tool ecosystem using a simple client implementation.

In the rest of this article, we assume each LLM request already has access to these augmented capabilities.

Workflow: Prompt Chaining

Prompt chaining decomposes a task into sequential steps, where each LLM call processes the previous output. Programmatic checks can be added at any intermediate step to keep execution on track.

When to use this workflow: prompt chaining is ideal when a task can be cleanly broken into fixed subtasks. The goal is to improve accuracy by simplifying each LLM call, at the cost of extra latency.

Useful examples for prompt chaining

Workflow: Routing

Routing identifies the input type and directs it to a corresponding downstream task. This enables separation of responsibilities and more specialized prompts. Without routing, optimizing for one input class can hurt performance on others.

When to use this workflow: routing works well for complex tasks where distinct categories are better handled separately, and category assignment can be done accurately via an LLM or a traditional classifier.

Useful examples for routing

Workflow: Parallelization

LLMs can sometimes process parts of a task simultaneously and combine outputs programmatically. This workflow has two main variants:

  1. Sectioning: split a task into independent subtasks and run them in parallel.
  2. Voting: run the same task multiple times to sample an output distribution.

When to use this workflow: parallelization is effective when subtasks can run independently for speed, or when multiple perspectives/attempts improve confidence. For complex tasks with multiple concerns, separate LLM calls per concern often improve quality through focused attention.

Useful examples for parallelization

Workflow: Orchestrator-workers

In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and combines the results.

When to use this workflow: ideal when required subtasks are not predictable in advance (for example coding tasks, where the number of files and edits depends on the issue). Although visually similar to parallelization, the key difference is flexibility: subtasks are decided by the orchestrator at runtime, not predefined.

Useful examples for orchestrator-workers

Workflow: Evaluator-optimizer

In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in an iterative loop.

When to use this workflow: especially effective when evaluation criteria are clear and iterative refinement creates measurable value. Two strong signals are: (1) the output improves clearly when given human-style feedback; (2) an LLM can generate that feedback well. This mirrors human iterative drafting for polished writing.

Useful examples for evaluator-optimizer

Agents

As LLM capabilities mature in key dimensions (understanding complex input, reasoning/planning, reliable tool use, and recovery from errors), agents are increasingly viable in production.

Agent work usually starts from a user command or interactive discussion. Once the task is clear, the agent plans and executes autonomously, and may ask humans for additional information or judgment when needed. During execution, agents must repeatedly get “ground truth” from the environment (tool outputs, code execution results, etc.) to evaluate progress. Agents can pause at checkpoints or blockers for human feedback. Tasks usually terminate on completion, but should also include stop conditions (for example max iterations) for control.

Agents can handle complex tasks, but implementations are often simple. In many cases, an agent is just an LLM using tools in a loop based on environmental feedback. This is why clear toolset design and documentation are critical. Appendix 2 expands on tool prompt-engineering practices.

When to use agents: agents are suited to open-ended problems where required steps cannot be predicted and fixed paths cannot be hardcoded. LLMs may operate over multiple turns, and you must trust model decision-making to some extent. Agent autonomy makes them ideal for scaling work in trusted environments.

That same autonomy also increases cost and error accumulation risk. We recommend extensive sandbox testing and appropriate guardrails.

Useful examples for agents

Examples from our own implementations:

Combining and Customizing These Patterns

These building blocks are not rigid templates. They are common patterns you can shape and combine per use case. As with any LLM feature, success depends on measuring outcomes and iterating continuously. Again: only add complexity when it materially improves results.

Summary

Success in LLM applications is not about building the most complex system, but the system that fits your needs. Start with simple prompting, optimize through thorough evaluation, and only add multi-step autonomous systems when simpler approaches fail.

When implementing agents, we try to follow three core principles:

  1. Keep agent design simple.
  2. Prioritize transparency and clearly expose planning steps.
  3. Build the agent-computer interface (ACI) carefully through strong tool documentation and testing.

Frameworks can help you start quickly, but when moving to production, do not hesitate to reduce abstraction and build with fundamental components. Following these principles helps create agents that are powerful, reliable, maintainable, and trusted by users.

Appendix 1: Agents in Practice

Our work with customers has revealed two especially promising application areas that demonstrate the practical value of the patterns above. Both show that agents create the most value in tasks that require both dialogue and action, have clear success criteria, support feedback loops, and allow effective human oversight.

A. Customer Support

Customer support combines a familiar chatbot interface with capabilities enhanced by tool integration. This is a strong fit for open-ended agents because:

Several companies have demonstrated viability through usage-based pricing that charges only for successful resolutions, reflecting confidence in agent effectiveness.

B. Coding Agents

Software engineering shows significant potential for LLM agents, evolving from code completion to autonomous problem solving. Agents are effective here because:

In our own implementation, agents can now resolve real GitHub issues directly from pull-request descriptions on SWE-bench Verified. Still, while automated tests validate functionality, human review remains essential to ensure broader system alignment.

Appendix 2: Prompt Engineering for Tools

No matter what type of agent system you are building, tools are likely core components. Tools allow Claude to interact with external services and APIs through explicit structure and definitions in our API. When Claude intends to call a tool, the API response includes a tool-use block.

Tool definitions deserve the same level of prompt-engineering attention as your main system prompt. This appendix summarizes practical guidance.

There are often multiple ways to represent the same operation. For example, file editing can be represented as diffs or full-file rewrites. Structured output can be returned as markdown or JSON. In software engineering these formats are often equivalent, but some are harder for LLMs to produce reliably. Writing diffs requires precise line accounting. Writing code inside JSON requires extra escaping for quotes and newlines.

Our recommendations for choosing tool formats:

A practical rule: invest as much effort in ACI (agent-computer interface) as teams normally invest in HCI (human-computer interface).

Ideas to improve ACI quality:

When building our agent for SWE-bench, we actually spent more time optimizing tools than the overall prompt. For example, we found that relative file paths caused failures once the agent moved away from root directories. We fixed this by requiring absolute file paths in tools, and observed much more reliable model behavior.

Previous
Core Practices of Context Engineering: Lessons from Manus
Next
0x01 Python venv