Overview
The survey summarizes many prototypical systems and empirical signs of what works, but findings rely on diverse papers with varied benchmarks and some contamination concerns; apply designs conservatively and validate with your own tests.
Citations19
Evidence Strength0.60
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 2/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/3
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Choose single agents for narrow, tool-driven tasks and multi-agent teams for complex, parallel workflows; add clear leadership, role prompts, and message filtering to improve speed and reliability.
Who Should Care
Summary TLDR
This short survey maps current AI agent designs that combine large language models (LLMs) with planning and tool calls. It compares single-agent and multi-agent patterns, catalogs design choices (leadership, memory, message filtering, dynamic teams), and summarizes strengths and failure modes. Key practical points: single agents are simpler and work well for narrowly scoped tool-driven tasks; multi-agent teams help parallelize, provide diverse feedback, and often benefit from a designated leader or dynamic team management. Evaluation gaps and benchmark contamination remain major limits.
Problem Statement
Practitioners need a clear, practical view of modern LLM-powered agent architectures: when to pick single vs multi-agent, which design elements matter for robust planning and tool use, and what current research says about evaluation gaps and failure modes.
Main Contribution
A compact taxonomy and comparison of single-agent vs. multi-agent architectures and their variants (vertical/horizontal).
A focused checklist of design levers that improve agent performance: leadership, planning phases, role definition, message filtering, dynamic teams, and human feedback.
Key Findings
ReAct reduces factual hallucination versus Chain-of-Thought on HotpotQA.
Designating a team leader speeds multi-agent task completion.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| hallucination rate | ReAct: 6%; CoT: 14% | Chain-of-Thought | −8 percentage points vs CoT | HotpotQA | ReAct evaluation on HotpotQA | [29,32] |
| time-to-completion | ≈10% faster with a team leader | leaderless teams | ≈−10% time | Embodied LLM team experiments | Leader improves coordination and reduces wasted chat | [9] |
What To Try In 7 Days
Prototype a single-agent flow with a tight persona and a short scratchpad memory.
Run a small multi-agent demo with a designated leader and one specialist agent to test parallelism.
Add a simple message filter so agents only receive task-relevant messages and measure time-to-completion.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Reproducibility
Risks & Boundaries
Limitations
Heterogeneous and often proprietary benchmarks make cross-paper comparison hard.
Training data contamination can inflate reported benchmark performance.
When Not To Use
Avoid multi-agent teams for narrowly defined, single-tool workflows where overhead outweighs benefit.
Avoid agentic autonomy without human oversight on high-stakes or safety-critical tasks.
Failure Modes
Agents get stuck in repetitive reasoning-action loops and never terminate.
Role hallucination: agents perform capabilities outside their intended role.

