Practical survey of single- vs. multi-agent designs, planning steps, and tool calling trade-offs

Overview

Decision SnapshotNeeds Validation

The survey summarizes many prototypical systems and empirical signs of what works, but findings rely on diverse papers with varied benchmarks and some contamination concerns; apply designs conservatively and validate with your own tests.

Citations19

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 50%

Authors

Tula Masterman, Sandi Besen, Mason Sawtell, Alex Chao

Links

Abstract / PDF

Why It Matters For Business

Choose single agents for narrow, tool-driven tasks and multi-agent teams for complex, parallel workflows; add clear leadership, role prompts, and message filtering to improve speed and reliability.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This short survey maps current AI agent designs that combine large language models (LLMs) with planning and tool calls. It compares single-agent and multi-agent patterns, catalogs design choices (leadership, memory, message filtering, dynamic teams), and summarizes strengths and failure modes. Key practical points: single agents are simpler and work well for narrowly scoped tool-driven tasks; multi-agent teams help parallelize, provide diverse feedback, and often benefit from a designated leader or dynamic team management. Evaluation gaps and benchmark contamination remain major limits.

Problem Statement

Practitioners need a clear, practical view of modern LLM-powered agent architectures: when to pick single vs multi-agent, which design elements matter for robust planning and tool use, and what current research says about evaluation gaps and failure modes.

Main Contribution

A compact taxonomy and comparison of single-agent vs. multi-agent architectures and their variants (vertical/horizontal).

A focused checklist of design levers that improve agent performance: leadership, planning phases, role definition, message filtering, dynamic teams, and human feedback.

Key Findings

ReAct reduces factual hallucination versus Chain-of-Thought on HotpotQA.

Numbers6% hallucination (ReAct) vs 14% (CoT) on HotpotQA

Practical UseUse an interleaved reason-act loop (ReAct) to lower hallucination when solving multi-step QA tasks; monitor for repetitive loops.

Evidence RefReAct / HotpotQA [29,32]

Designating a team leader speeds multi-agent task completion.

Numbers≈10% faster time-to-completion with a leader

Practical UseAdd a clear leader (human or model) in team-based agents to reduce time and coordination overhead.

Evidence RefEmbodied LLM Teams [9]

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
hallucination rate	ReAct: 6%; CoT: 14%	Chain-of-Thought	−8 percentage points vs CoT	HotpotQA	ReAct evaluation on HotpotQA	[29,32]
time-to-completion	≈10% faster with a team leader	leaderless teams	≈−10% time	Embodied LLM team experiments	Leader improves coordination and reduces wasted chat	[9]

What To Try In 7 Days

Prototype a single-agent flow with a tight persona and a short scratchpad memory.

Run a small multi-agent demo with a designated leader and one specialist agent to test parallelism.

Add a simple message filter so agents only receive task-relevant messages and measure time-to-completion.

Agent Features

Memory

scratchpad_short_termlong_term_dataset_memorysliding_window_context

Planning

task_decompositionmulti-plan_selectionexternal_module_planningreflection_refinementmemory_augmented_planning

Tool Use

function_callingapi_integrationrobotic_task_planningtool_selection

Frameworks

ReActRAISEReflexionAutoGPT+PLATSAgentVerseDyLANMetaGPT

Is Agentic

Yes

Architectures

single-agentmulti-agentvertical (leader-based)horizontal (peer-based)dynamic teams

Collaboration

leader-basedpeer-to-peerpublish-subscribe

Optimization Features

Token Efficiency

sliding_window_memory

System Optimization

dynamic_agent_recruitment

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Heterogeneous and often proprietary benchmarks make cross-paper comparison hard.

Training data contamination can inflate reported benchmark performance.

When Not To Use

Avoid multi-agent teams for narrowly defined, single-tool workflows where overhead outweighs benefit.

Avoid agentic autonomy without human oversight on high-stakes or safety-critical tasks.

Failure Modes

Agents get stuck in repetitive reasoning-action loops and never terminate.

Role hallucination: agents perform capabilities outside their intended role.

Core Entities

Models

GPT-4GPT-3.5-turboGPT-4+

Metrics

time-to-completionsuccess ratecommunication costhallucination rateoutput similarity to human responsesefficiency of tool use

Datasets

HotpotQAHumanEvalMBPPWildChat/WildBench (570k)AgentBenchSmartPlaySWE-benchMMLUGSM8KStrategyQA

Benchmarks

AgentBenchSmartPlayWildBenchHumanEvalMBPPSWE-bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ReAct reduces factual hallucination versus Chain-of-Thought on HotpotQA.

Designating a team leader speeds multi-agent task completion.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

Key finding

R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

Key finding

A single LLM can role-play homogeneous multi-agent workflows and cut inference cost via KV-cache reuse

Key finding

DeceptGuard: detect agent deception by reading CoT text and activation probes

Key finding