Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.5
Citation Count
19
Why It Matters For Business
Choose single agents for narrow, tool-driven tasks and multi-agent teams for complex, parallel workflows; add clear leadership, role prompts, and message filtering to improve speed and reliability.
Summary TLDR
This short survey maps current AI agent designs that combine large language models (LLMs) with planning and tool calls. It compares single-agent and multi-agent patterns, catalogs design choices (leadership, memory, message filtering, dynamic teams), and summarizes strengths and failure modes. Key practical points: single agents are simpler and work well for narrowly scoped tool-driven tasks; multi-agent teams help parallelize, provide diverse feedback, and often benefit from a designated leader or dynamic team management. Evaluation gaps and benchmark contamination remain major limits.
Problem Statement
Practitioners need a clear, practical view of modern LLM-powered agent architectures: when to pick single vs multi-agent, which design elements matter for robust planning and tool use, and what current research says about evaluation gaps and failure modes.
Main Contribution
A compact taxonomy and comparison of single-agent vs. multi-agent architectures and their variants (vertical/horizontal).
A focused checklist of design levers that improve agent performance: leadership, planning phases, role definition, message filtering, dynamic teams, and human feedback.
A synthesis of representative agent patterns (ReAct, RAISE, Reflexion, AutoGPT+P, LATS, AgentVerse, DyLAN, MetaGPT) and their practical trade-offs.
Key Findings
ReAct reduces factual hallucination versus Chain-of-Thought on HotpotQA.
Designating a team leader speeds multi-agent task completion.
Dynamic team construction and rotating leadership improve efficiency and communication cost.
Benchmarks and training data contamination distort agent evaluation.
Results
hallucination rate
time-to-completion
dataset size
Who Should Care
What To Try In 7 Days
Prototype a single-agent flow with a tight persona and a short scratchpad memory.
Run a small multi-agent demo with a designated leader and one specialist agent to test parallelism.
Add a simple message filter so agents only receive task-relevant messages and measure time-to-completion.
Agent Features
Memory
- scratchpad_short_term
- long_term_dataset_memory
- sliding_window_context
Planning
- task_decomposition
- multi-plan_selection
- external_module_planning
- reflection_refinement
- memory_augmented_planning
Tool Use
- function_calling
- api_integration
- robotic_task_planning
- tool_selection
Frameworks
- ReAct
- RAISE
- Reflexion
- AutoGPT+P
- LATS
- AgentVerse
- DyLAN
- MetaGPT
Is Agentic
true
Architectures
- single-agent
- multi-agent
- vertical (leader-based)
- horizontal (peer-based)
- dynamic teams
Collaboration
- leader-based
- peer-to-peer
- publish-subscribe
Optimization Features
Token Efficiency
- sliding_window_memory
System Optimization
- dynamic_agent_recruitment
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Heterogeneous and often proprietary benchmarks make cross-paper comparison hard.
- Training data contamination can inflate reported benchmark performance.
- Many agent evaluations use small or hand-scored datasets prone to bias.
- Multi-agent chatter and role confusion remain unsolved and task-dependent.
When Not To Use
- Avoid multi-agent teams for narrowly defined, single-tool workflows where overhead outweighs benefit.
- Avoid agentic autonomy without human oversight on high-stakes or safety-critical tasks.
- Avoid relying solely on static public benchmarks to judge agent readiness.
Failure Modes
- Agents get stuck in repetitive reasoning-action loops and never terminate.
- Role hallucination: agents perform capabilities outside their intended role.
- Team chatter consumes bandwidth and reduces task focus in horizontal teams.
- Leader failure: a leader can omit crucial info and break team coordination.
Core Entities
Models
- GPT-4
- GPT-3.5-turbo
- GPT-4+
Metrics
- time-to-completion
- success rate
- communication cost
- hallucination rate
- output similarity to human responses
- efficiency of tool use
Datasets
- HotpotQA
- HumanEval
- MBPP
- WildChat/WildBench (570k)
- AgentBench
- SmartPlay
- SWE-bench
- MMLU
- GSM8K
- StrategyQA
Benchmarks
- AgentBench
- SmartPlay
- WildBench
- HumanEval
- MBPP
- SWE-bench

