Practical survey of single- vs. multi-agent designs, planning steps, and tool calling trade-offs

April 17, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.5

Citation Count

19

Authors

Tula Masterman, Sandi Besen, Mason Sawtell, Alex Chao

Links

Abstract / PDF

Why It Matters For Business

Choose single agents for narrow, tool-driven tasks and multi-agent teams for complex, parallel workflows; add clear leadership, role prompts, and message filtering to improve speed and reliability.

Summary TLDR

This short survey maps current AI agent designs that combine large language models (LLMs) with planning and tool calls. It compares single-agent and multi-agent patterns, catalogs design choices (leadership, memory, message filtering, dynamic teams), and summarizes strengths and failure modes. Key practical points: single agents are simpler and work well for narrowly scoped tool-driven tasks; multi-agent teams help parallelize, provide diverse feedback, and often benefit from a designated leader or dynamic team management. Evaluation gaps and benchmark contamination remain major limits.

Problem Statement

Practitioners need a clear, practical view of modern LLM-powered agent architectures: when to pick single vs multi-agent, which design elements matter for robust planning and tool use, and what current research says about evaluation gaps and failure modes.

Main Contribution

A compact taxonomy and comparison of single-agent vs. multi-agent architectures and their variants (vertical/horizontal).

A focused checklist of design levers that improve agent performance: leadership, planning phases, role definition, message filtering, dynamic teams, and human feedback.

A synthesis of representative agent patterns (ReAct, RAISE, Reflexion, AutoGPT+P, LATS, AgentVerse, DyLAN, MetaGPT) and their practical trade-offs.

Key Findings

ReAct reduces factual hallucination versus Chain-of-Thought on HotpotQA.

Numbers6% hallucination (ReAct) vs 14% (CoT) on HotpotQA

Designating a team leader speeds multi-agent task completion.

Numbers≈10% faster time-to-completion with a leader

Dynamic team construction and rotating leadership improve efficiency and communication cost.

Benchmarks and training data contamination distort agent evaluation.

Results

hallucination rate

ValueReAct: 6%; CoT: 14%

BaselineChain-of-Thought

time-to-completion

Value≈10% faster with a team leader

Baselineleaderless teams

dataset size

Value570,000 chat logs

Who Should Care

What To Try In 7 Days

Prototype a single-agent flow with a tight persona and a short scratchpad memory.

Run a small multi-agent demo with a designated leader and one specialist agent to test parallelism.

Add a simple message filter so agents only receive task-relevant messages and measure time-to-completion.

Agent Features

Memory

  • scratchpad_short_term
  • long_term_dataset_memory
  • sliding_window_context

Planning

  • task_decomposition
  • multi-plan_selection
  • external_module_planning
  • reflection_refinement
  • memory_augmented_planning

Tool Use

  • function_calling
  • api_integration
  • robotic_task_planning
  • tool_selection

Frameworks

  • ReAct
  • RAISE
  • Reflexion
  • AutoGPT+P
  • LATS
  • AgentVerse
  • DyLAN
  • MetaGPT

Is Agentic

true

Architectures

  • single-agent
  • multi-agent
  • vertical (leader-based)
  • horizontal (peer-based)
  • dynamic teams

Collaboration

  • leader-based
  • peer-to-peer
  • publish-subscribe

Optimization Features

Token Efficiency

  • sliding_window_memory

System Optimization

  • dynamic_agent_recruitment

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Heterogeneous and often proprietary benchmarks make cross-paper comparison hard.
  • Training data contamination can inflate reported benchmark performance.
  • Many agent evaluations use small or hand-scored datasets prone to bias.
  • Multi-agent chatter and role confusion remain unsolved and task-dependent.

When Not To Use

  • Avoid multi-agent teams for narrowly defined, single-tool workflows where overhead outweighs benefit.
  • Avoid agentic autonomy without human oversight on high-stakes or safety-critical tasks.
  • Avoid relying solely on static public benchmarks to judge agent readiness.

Failure Modes

  • Agents get stuck in repetitive reasoning-action loops and never terminate.
  • Role hallucination: agents perform capabilities outside their intended role.
  • Team chatter consumes bandwidth and reduces task focus in horizontal teams.
  • Leader failure: a leader can omit crucial info and break team coordination.

Core Entities

Models

  • GPT-4
  • GPT-3.5-turbo
  • GPT-4+

Metrics

  • time-to-completion
  • success rate
  • communication cost
  • hallucination rate
  • output similarity to human responses
  • efficiency of tool use

Datasets

  • HotpotQA
  • HumanEval
  • MBPP
  • WildChat/WildBench (570k)
  • AgentBench
  • SmartPlay
  • SWE-bench
  • MMLU
  • GSM8K
  • StrategyQA

Benchmarks

  • AgentBench
  • SmartPlay
  • WildBench
  • HumanEval
  • MBPP
  • SWE-bench