Overview
The paper runs broad experiments on many tasks and models and offers practical mitigations; apply the methods on your workloads and measure token/latency tradeoffs before changing infra.
Citations2
Evidence Strength0.75
Confidence0.90
Risk Signals8
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
MAS increases accuracy only for very hard tasks but multiplies deployment cost; hybrid routing/cascade lets you save API and latency costs while keeping or improving accuracy.
Who Should Care
Summary TLDR
This paper compares multi-agent systems (MAS) to single-agent systems (SAS) across 15 agentic tasks (code, math, planning, RAG). As LLM capabilities improve, MAS accuracy gains shrink while token and runtime costs remain much higher (MAS uses 4–220× more prefill tokens on tested tasks). The authors propose (1) a confidence-guided tracing method to find the critical agent to upgrade, and (2) hybrid routing and cascade strategies that send easy requests to SAS and hard ones to MAS. The hybrid designs raise accuracy by ~1.1–12% and can cut deployment cost substantially (authors report up to 88.1% in best cases).
Problem Statement
MAS split problems across role-specific LLM agents and often improve accuracy, but they cost more and are complex to deploy. With stronger base LLMs, the paper asks when MAS still helps, why MAS can fail, and how to combine MAS and SAS for better cost/accuracy tradeoffs.
Main Contribution
Large empirical comparison of MAS vs SAS across 15 datasets and 7 tasks using multiple frameworks and LLMs.
A graph-based defect taxonomy (node-, edge-, path-level) explaining why MAS underperforms or wastes cost.
Key Findings
MAS accuracy advantage shrinks as underlying LLMs improve.
MAS consumes far more tokens than SAS, raising runtime and monetary cost.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | SAS 67.0% vs MAS 87.7% (ChatGPT baseline); with Gemini-2.0-Flash SAS 90.2% vs MAS 93.2% | original ChatGPT in prior work | improvement drops from +10.7% to +3.0% | MetaGPT / HumanEval | Table 2 (comparison between original ChatGPT results and Gemini-2.0-Flash runs) | Table 2 |
| Token cost (prefill) MAS vs SAS | MAS uses 4–220× more prefill tokens across datasets (examples in Table 3) | SAS token usage | MAS/SAS ratio per dataset | MBPP, HumanEval, GSM8K, AIME, etc. | Table 3 (token counts per dataset) | Table 3 |
What To Try In 7 Days
Benchmark your current SAS baseline vs MAS using your target LLM; measure token counts and latency.
Run the paper's confidence-guided tracing on one MAS workflow to find the critical agent to upgrade.
Implement an SAS-first cascade with a cheap verifier (exact-match or unit tests) and measure cost vs accuracy.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments focus on general-purpose LLMs, not fine-tuned agent-specialized models.
Agent frameworks and prompts were adapted; exact reproduction may require the authors' code and prompts.
When Not To Use
Do not use agent cascade when outputs cannot be cheaply and automatically verified.
Avoid converting SAS to MAS blindly for well-scoped, simple tasks — SAS may be cheaper and as accurate.
Failure Modes
Node-level failure: a weak critical agent caps MAS performance.
Edge-level failure: downstream agents get overwhelmed by upstream messages (overthinking).

