Practical survey of single- vs. multi-agent designs, planning steps, and tool calling trade-offs

April 17, 20246 min

Overview

Decision SnapshotNeeds Validation

The survey summarizes many prototypical systems and empirical signs of what works, but findings rely on diverse papers with varied benchmarks and some contamination concerns; apply designs conservatively and validate with your own tests.

Citations19

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 50%

Authors

Tula Masterman, Sandi Besen, Mason Sawtell, Alex Chao

Links

Abstract / PDF

Why It Matters For Business

Choose single agents for narrow, tool-driven tasks and multi-agent teams for complex, parallel workflows; add clear leadership, role prompts, and message filtering to improve speed and reliability.

Who Should Care

Summary TLDR

This short survey maps current AI agent designs that combine large language models (LLMs) with planning and tool calls. It compares single-agent and multi-agent patterns, catalogs design choices (leadership, memory, message filtering, dynamic teams), and summarizes strengths and failure modes. Key practical points: single agents are simpler and work well for narrowly scoped tool-driven tasks; multi-agent teams help parallelize, provide diverse feedback, and often benefit from a designated leader or dynamic team management. Evaluation gaps and benchmark contamination remain major limits.

Problem Statement

Practitioners need a clear, practical view of modern LLM-powered agent architectures: when to pick single vs multi-agent, which design elements matter for robust planning and tool use, and what current research says about evaluation gaps and failure modes.

Main Contribution

A compact taxonomy and comparison of single-agent vs. multi-agent architectures and their variants (vertical/horizontal).

A focused checklist of design levers that improve agent performance: leadership, planning phases, role definition, message filtering, dynamic teams, and human feedback.

Key Findings

ReAct reduces factual hallucination versus Chain-of-Thought on HotpotQA.

Numbers6% hallucination (ReAct) vs 14% (CoT) on HotpotQA

Practical UseUse an interleaved reason-act loop (ReAct) to lower hallucination when solving multi-step QA tasks; monitor for repetitive loops.

Evidence RefReAct / HotpotQA [29,32]

Designating a team leader speeds multi-agent task completion.

Numbers≈10% faster time-to-completion with a leader

Practical UseAdd a clear leader (human or model) in team-based agents to reduce time and coordination overhead.

Evidence RefEmbodied LLM Teams [9]

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
hallucination rateReAct: 6%; CoT: 14%Chain-of-Thought−8 percentage points vs CoTHotpotQAReAct evaluation on HotpotQA[29,32]
time-to-completion≈10% faster with a team leaderleaderless teams≈−10% timeEmbodied LLM team experimentsLeader improves coordination and reduces wasted chat[9]

What To Try In 7 Days

Prototype a single-agent flow with a tight persona and a short scratchpad memory.

Run a small multi-agent demo with a designated leader and one specialist agent to test parallelism.

Add a simple message filter so agents only receive task-relevant messages and measure time-to-completion.

Agent Features

Memory
scratchpad_short_termlong_term_dataset_memorysliding_window_context
Planning
task_decompositionmulti-plan_selectionexternal_module_planningreflection_refinementmemory_augmented_planning
Tool Use
function_callingapi_integrationrobotic_task_planningtool_selection
Frameworks
ReActRAISEReflexionAutoGPT+PLATSAgentVerseDyLANMetaGPT
Is Agentic

Yes

Architectures
single-agentmulti-agentvertical (leader-based)horizontal (peer-based)dynamic teams
Collaboration
leader-basedpeer-to-peerpublish-subscribe

Optimization Features

Token Efficiency
sliding_window_memory
System Optimization
dynamic_agent_recruitment

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Heterogeneous and often proprietary benchmarks make cross-paper comparison hard.

Training data contamination can inflate reported benchmark performance.

When Not To Use

Avoid multi-agent teams for narrowly defined, single-tool workflows where overhead outweighs benefit.

Avoid agentic autonomy without human oversight on high-stakes or safety-critical tasks.

Failure Modes

Agents get stuck in repetitive reasoning-action loops and never terminate.

Role hallucination: agents perform capabilities outside their intended role.

Core Entities

Models

GPT-4GPT-3.5-turboGPT-4+

Metrics

time-to-completionsuccess ratecommunication costhallucination rateoutput similarity to human responsesefficiency of tool use

Datasets

HotpotQAHumanEvalMBPPWildChat/WildBench (570k)AgentBenchSmartPlaySWE-benchMMLUGSM8KStrategyQA

Benchmarks

AgentBenchSmartPlayWildBenchHumanEvalMBPPSWE-bench