Multi-Agent Systems Papers — Parsed & Scored for Practitioners

MetaGPT: use human-style SOPs, role agents and runtime execution checks to improve multi-agent code generation

0.70

0.60

130

MetaGPT applies team-style SOPs and runtime test loops to LLM agents, producing more runnable code and fewer manual fixes—trade higher token costs for reduced engineering review time and higher delivery quality.

Key finding

High functional accuracy on public code benchmarks.

Numbers: Pass@1 = 85.9% and 87.7% on evaluated benchmarks

Let multiple copies of an LLM debate to improve reasoning and reduce hallucinations

0.50

0.60

0.40

85

If accuracy matters more than latency, running several LLM copies that debate can materially reduce wrong answers and hallucinations, producing higher-quality outputs for QA, math, and plan generation.

Key finding

Multiagent debate raises arithmetic accuracy from 67.0% to 81.8% on their test set.

Numbers: Arithmetic: 67.0% → 81.8% (Table 1)

ChatDev: multi-agent LLMs that chat to design, code, and test software

0.40

0.50

0.30

69

ChatDev makes prototyping software faster and more reliable by combining role-based LLM agents into a chained workflow that raises the chance code runs without heavy manual fixes.

Key finding

ChatDev generates more runnable software than baselines.

Numbers: Executability: ChatDev 0.88 vs GPT-Engineer 0.3583, MetaGPT 0.4145

A modular graph framework that lets multiple LLM agents collaborate, create agents, and supervise each other

0.30

0.60

0.40

50

Modular LLM agents let teams split complex workflows, add verifiers to reduce costly errors, and plug in APIs safely — but they add orchestration costs and governance requirements.

Key finding

Agents can be modeled as tuples (L, R, S, C, H) to standardize behavior and permissions.

Practical survey of single- vs. multi-agent designs, planning steps, and tool calling trade-offs

0.60

0.50

19

Choose single agents for narrow, tool-driven tasks and multi-agent teams for complex, parallel workflows; add clear leadership, role prompts, and message filtering to improve speed and reliability.

Key finding

ReAct reduces factual hallucination versus Chain-of-Thought on HotpotQA.

Numbers: 6% hallucination (ReAct) vs 14% (CoT) on HotpotQA

Survey: how LLMs and LLM-based agents reshape software engineering workflows

0.60

0.45

0.60

15

LLM-based agents enable higher automation for multi-step engineering tasks (planning, tool use, testing) and often raise real pass rates and reduce human iteration; single LLMs remain cheaper for isolated code generation or simple analysis.

Key finding

Survey corpus and venue split — the review covers 139 papers and many are preprints.

Numbers: 139 papers; arXiv accounts for 40.3% of papers

AgentSims: a visual, multi-agent sandbox to build task-based LLM benchmarks quickly

0.50

0.60

13

AgentSims helps teams test language models in realistic, multi-step roles (e.g., mayor, employee). That reveals operational gaps not visible with static benchmarks and speeds prototyping for productized agents.

Key finding

Task-based evaluation reduces hackability, broadens tested abilities, and yields an objective pass rate.

Use LLM agents and a fishbowl discussion to simulate participatory urban planning and improve resident satisfaction

0.30

0.60

0.50

11

Simulated multi-agent LLM planning can surface local needs early, reducing time and rehearsal costs before engaging humans; it helps test many “what-if” land-use options quickly while keeping service coverage competitive.

Key finding

Simulated participatory planning raised resident Satisfaction to 0.787 on HLG.

Numbers: Satisfaction 0.787 (HLG) vs 0.708 (best baseline DRL)

Use small LLM agents to filter and block jailbreak responses from larger models

0.60

0.70

11

AutoDefense offers a plug-in, model-agnostic layer to block harmful outputs without retraining or changing user prompts, reducing legal and reputational risk while keeping product utility.

Key finding

Three-agent AutoDefense with LLaMA-2-13B cuts GPT-3.5 ASR from 55.74% to 7.95% on the DAN jailbreak set.

Numbers: ASR 55.74% → 7.95% (DAN, GPT-3.5 victim)

A manager–analyst LLM multi-agent that uses verbalized, episode-level belief updates (CVRF) plus daily CVaR alerts to improve trading and (小

0.45

0.60

0.50

11

FINCON shows that structuring LLMs like a small investment team plus two-tiered risk controls can raise backtested returns and Sharpe ratios while reducing chatter. This suggests a practical path for building LLM-based decision pipelines for small active portfolios and research prototypes.

Key finding

FINCON produces much higher cumulative returns on tested stocks than baselines.

Numbers: TSLA CR 82.871% vs buy-and-hold 6.425% (Table 2)

PIANO: a concurrent, bottlenecked agent brain that scales to 10–1000+ agents and yields specialization, laws, and cultural spread in sandbox

0.20

0.70

0.60

10

PIANO shows how modular, concurrent agent brains plus a small coordination bottleneck produce coherent multi-stream behavior at scale. This matters for products that require many autonomous agents to self-organize, coordinate, or influence user communities—e.g., simulation platforms, game NPCs, synthetic user testing,社

Key finding

Single-agent item progression: agents with full PIANO acquired on average 17 unique Minecraft items after 30 minutes.

Numbers: avg 17 unique items / agent @ 30 min (Figure 5A)

AgentLite: tiny open-source toolkit to rapidly prototype task-oriented and multi-agent LLM systems

0.60

0.40

0.50

10

AgentLite reduces code overhead for prototyping LLM agents so engineering teams can test agent ideas quickly without a heavy framework or large code refactor.

Key finding

AgentLite is small and focused: core codebase is under 1,000 lines.

Numbers: AgentLite core lines = 959; LangChain = 248,650 (Table 1)

Survey and roadmap for LLM-based multi-agent systems applied to software engineering

0.40

0.60

0.65

8

Multi-agent LLM systems can automate and speed up routine engineering tasks, lowering prototyping cost and time; but scale and correctness limits mean human oversight is still required for complex or safety-critical work.

Key finding

Surveyed 71 recent primary studies on LMA in software engineering.

Numbers: 71 primary studies (41 identified then +30 via snowballing)

GAMABench: a dynamic multi‑agent game benchmark that reveals LLMs' weak generalization and prompt sensitivity

0.60

0.40

8

GAMABench exposes how LLMs handle multi‑agent strategic choices, revealing leaderboard gaps and prompt sensitivities that affect any product using LLMs for negotiation, coordination, or recommendation.

Key finding

Closed- and open-source leaderboard: Gemini-1.5-Pro performs best.

Numbers: Gemini-1.5-Pro: 69.8 /100; LLaMA-3.1-70B: 65.9; Mixtral-8x22B: 62.4

Domain-specific AI agents collaborate to find cross-domain knowledge

0.30

0.50

0.40

7

Orchestrated domain-specific agents can raise answer accuracy for cross-field queries, trading speed for higher-quality, context-aware results.

Key finding

Agents were seeded with domain literature to create domain-specific expertise.

Numbers: ≈1000 papers per agent (Section 2.1)

A compact map of context-aware multi-agent systems and the five capabilities agents need to work reliably in dynamic settings

0.30

0.40

0.30

6

Context-aware multi-agent design increases robustness and scalability for distributed automation, but requires upfront choices on organization, communication and privacy to avoid noisy or insecure data sharing.

Key finding

CA-MAS design revolves around five agent phases: Sense, Learn, Reason, Predict, Act.

Numbers: 5 phases named explicitly in Section 4.2

How LLMs are being used to build game-playing agents: memory, reasoning, perception, and multi-agent design

0.40

0.60

0.50

6

Game agents are a practical lab for building interactive AI: solutions for memory, robust reasoning, and hybrid control transfer to real automation, simulations, and multi-agent coordination systems used in product testing and virtual worlds.

Key finding

Carrying the previous step's thought into the next prompt (LastThoughts) raises win rate and cuts short-term inconsistent actions.

Numbers: Win rate 0.4217 → 0.4667; consecutive switch rate 0.2442 → 0.0861

VillagerAgent: use DAGs to coordinate LLM agents and a new VillagerBench in Minecraft

0.40

0.60

0.55

6

Modeling task dependencies explicitly with a DAG reduces coordination errors and token cost in LLM-driven multi-agent workflows, making automated team coordination cheaper and more reliable in simulated task domains.

Key finding

VillagerAgent cuts hallucination-driven failures vs AgentVerse on cooking tasks.

Numbers: Failure rate 18.2% (VillagerAgent) vs 44.4% (AgentVerse)

Generate editable BIM models from plain language by orchestrating LLM agents that write modeling code

0.60

6

Text2BIM lets designers describe early-stage buildings in plain language and get editable BIM models, reducing manual modeling effort and speeding concept-to-BIM workflows while preserving the ability to refine results in standard BIM tools.

Key finding

The framework produced editable IFC/BIM models for 25 diverse prompts with 534 generated runs.

Numbers: 534 IFC models generated (25 prompts × 3 LLMs × 3 repeats incl. intermediate runs)

Make two LLMs argue, judge their claims, and tune debate tone to reduce bias and hallucination

0.40

0.60

0.50

6

SocraSynth turns LLM outputs from single-shot answers into cross-checked, debate-driven recommendations, which reduces obvious bias and yields richer, testable proposals—useful for policy, diagnostics, and decision support.

Key finding

Multi-agent debates scored higher than single-model Q&A on judged information quality.

Numbers: Table 5 & 6: GPT-4 judge totals A=39 vs B=32 (Table 5); role-swapped totals remain comparable

Survey of how LLMs reason strategically in multi-agent games, economics, and social simulations

0.40

0.30

6

LLM-driven agents can model multi-party dynamics (negotiations, markets, simulations) and improve decision-making, but measurement and domain alignment matter more than raw model size.

Key finding

LLM strategic work spans four scenario families: societal, economic, game-theory, and gaming.

Numbers: 4 scenario categories

Make LLMs more creative by running multi‑round role‑played discussions instead of single prompts

0.60

0.65

0.35

6

A structured multi‑agent, role‑played discussion can produce noticeably more original and detailed ideas than single prompts, useful for ideation, product concepts, and creative marketing at modest engineering cost.

Key finding

LLM Discussion increases originality on AUT compared to single‑agent baseline

Numbers: Originality mean 4.44 vs 3.47 (LLM eval, AUT, Table 2)

MACNET: use directed acyclic graphs to scale LLM agents and show a logistic ‘collaborative scaling law’

0.60

0.80

0.60

6

You can improve quality on mixed tasks by running many cooperating LLM agents in a DAG and avoid expensive retraining; randomized wiring often gives a good speed-quality trade-off.

Key finding

MACNET variants outperform multi-agent and single-agent baselines on average across diverse tasks.

Numbers: Quality: MACNET-RANDOM 0.6522 vs AGENTVERSE 0.5805 (Table 1).

Use attention-equipped diffusion models to learn coordinated multi-agent policies and predict joint trajectories from offline logs

0.60

0.70

0.50

5

MADiff can learn coordinated policies and reliable joint trajectory predictions from logs, enabling product features where online trials are costly or unsafe; it's best for small teams and stable environments.

Key finding

MADiff greatly improves multi-agent trajectory prediction on the NBA dataset.

Numbers: ADE 7.92 ± 0.86 vs 15.15 ± 0.38 (Baller2Vec++), traj len 20