Multi-agent Coordination Papers — Parsed & Scored for Practitioners

MetaGPT: use human-style SOPs, role agents and runtime execution checks to improve multi-agent code generation

0.70

0.60

130

MetaGPT applies team-style SOPs and runtime test loops to LLM agents, producing more runnable code and fewer manual fixes—trade higher token costs for reduced engineering review time and higher delivery quality.

Key finding

High functional accuracy on public code benchmarks.

Numbers: Pass@1 = 85.9% and 87.7% on evaluated benchmarks

Use LLM agents and a fishbowl discussion to simulate participatory urban planning and improve resident satisfaction

0.30

0.60

0.50

11

Simulated multi-agent LLM planning can surface local needs early, reducing time and rehearsal costs before engaging humans; it helps test many “what-if” land-use options quickly while keeping service coverage competitive.

Key finding

Simulated participatory planning raised resident Satisfaction to 0.787 on HLG.

Numbers: Satisfaction 0.787 (HLG) vs 0.708 (best baseline DRL)

A manager–analyst LLM multi-agent that uses verbalized, episode-level belief updates (CVRF) plus daily CVaR alerts to improve trading and (小

0.45

0.60

0.50

11

FINCON shows that structuring LLMs like a small investment team plus two-tiered risk controls can raise backtested returns and Sharpe ratios while reducing chatter. This suggests a practical path for building LLM-based decision pipelines for small active portfolios and research prototypes.

Key finding

FINCON produces much higher cumulative returns on tested stocks than baselines.

Numbers: TSLA CR 82.871% vs buy-and-hold 6.425% (Table 2)

PIANO: a concurrent, bottlenecked agent brain that scales to 10–1000+ agents and yields specialization, laws, and cultural spread in sandbox

0.20

0.70

0.60

10

PIANO shows how modular, concurrent agent brains plus a small coordination bottleneck produce coherent multi-stream behavior at scale. This matters for products that require many autonomous agents to self-organize, coordinate, or influence user communities—e.g., simulation platforms, game NPCs, synthetic user testing,社

Key finding

Single-agent item progression: agents with full PIANO acquired on average 17 unique Minecraft items after 30 minutes.

Numbers: avg 17 unique items / agent @ 30 min (Figure 5A)

Use natural-language instructions + LLM priors to steer multi‑agent RL toward human-friendly equilibria

0.60

0.30

10

You can steer multi-agent systems to human-friendly conventions without costly human behavior datasets; showing the agent's instruction to users sharply improves team performance and trust.

Key finding

In the Say-Select toy game, instructQ reliably converged to the intended human-like equilibrium.

Numbers: 10/10 random seeds converged to the instructed policy

A benchmark showing LLMs can coordinate by reading environments but struggle at partners' beliefs and joint planning

0.30

0.60

0.40

10

LLMs can act as zero-shot coordination partners for tasks where the environment dictates the correct action (logistics routing, scripted multi-robot tasks), cutting training time; but they are unreliable when partner modeling or multi-step joint planning is required.

Key finding

LLM agents match or exceed RL on environment-driven Overcooked layouts.

Numbers: GPT-4-turbo: 260 (AA layout) vs PBT: 190 (Table 1)

Small groups of LLM agents that debate early beat naïve scaling; round consistency and 3×3 setups save tokens

0.60

0.55

0.65

9

Multi-agent LLM setups can raise reasoning accuracy without just scaling model size; using small teams (3 agents) and debate-first protocols often gives better answers while controlling API token costs.

Key finding

Debate-initial or debate-dominant strategies give higher accuracy on reasoning benchmarks.

Numbers: MMLU: p0p0p1 = 65.2 vs p1p0p0 = 34.4 (S4 example)

Survey and roadmap for LLM-based multi-agent systems applied to software engineering

0.40

0.60

0.65

8

Multi-agent LLM systems can automate and speed up routine engineering tasks, lowering prototyping cost and time; but scale and correctness limits mean human oversight is still required for complex or safety-critical work.

Key finding

Surveyed 71 recent primary studies on LMA in software engineering.

Numbers: 71 primary studies (41 identified then +30 via snowballing)

Survey of LLM-based medical agents: architectures, applications, and safety gaps

0.40

0.60

8

LLM agents can cut clinician workload and improve documentation and training, but current models need workflow-style validation, bias checks, and human oversight before clinical deployment.

Key finding

Surveyed literature size and scope.

Numbers: 60 studies reviewed (from ~300 initial hits, 80 shortlisted).

Let an LLM program better agents in code: Meta Agent Search discovers agent workflows that beat hand‑designed agents on several benchmarks

0.40

0.80

0.60

7

Automated agent design can reduce manual engineering time and produce stronger task-specific agents, cutting error rates on QA and math tasks and enabling faster iteration on agent workflows.

Key finding

Meta Agent Search finds agents that substantially improve reading-comprehension performance over hand-designed agents.

Numbers: DROP F1 +13.6 pp (paper claim)

Domain-specific AI agents collaborate to find cross-domain knowledge

0.30

0.50

0.40

7

Orchestrated domain-specific agents can raise answer accuracy for cross-field queries, trading speed for higher-quality, context-aware results.

Key finding

Agents were seeded with domain literature to create domain-specific expertise.

Numbers: ≈1000 papers per agent (Section 2.1)

OpenHands: an open, sandboxed platform that lets LLM-based agents write, run, and browse code like software developers

0.65

0.60

0.70

7

OpenHands reduces the engineering work to run and compare LLM-driven developer agents by providing a sandboxed runtime, shared skills, and benchmark harness under an MIT license, so teams can prototype agent integrations faster and safely.

Key finding

A single generalist agent (CodeAct) performs competitively across software, web, and miscellaneous tasks without benchmark-specific prompt tuning.

Numbers: HumanEvalFix: 79.3% (CodeAct v1.5, gpt-4o); SWE-Bench Lite: 22–26% (CodeAct v1.8)

A compact map of context-aware multi-agent systems and the five capabilities agents need to work reliably in dynamic settings

0.30

0.40

0.30

6

Context-aware multi-agent design increases robustness and scalability for distributed automation, but requires upfront choices on organization, communication and privacy to avoid noisy or insecure data sharing.

Key finding

CA-MAS design revolves around five agent phases: Sense, Learn, Reason, Predict, Act.

Numbers: 5 phases named explicitly in Section 4.2

VillagerAgent: use DAGs to coordinate LLM agents and a new VillagerBench in Minecraft

0.40

0.60

0.55

6

Modeling task dependencies explicitly with a DAG reduces coordination errors and token cost in LLM-driven multi-agent workflows, making automated team coordination cheaper and more reliable in simulated task domains.

Key finding

VillagerAgent cuts hallucination-driven failures vs AgentVerse on cooking tasks.

Numbers: Failure rate 18.2% (VillagerAgent) vs 44.4% (AgentVerse)

MACNET: use directed acyclic graphs to scale LLM agents and show a logistic ‘collaborative scaling law’

0.60

0.80

0.60

6

You can improve quality on mixed tasks by running many cooperating LLM agents in a DAG and avoid expensive retraining; randomized wiring often gives a good speed-quality trade-off.

Key finding

MACNET variants outperform multi-agent and single-agent baselines on average across diverse tasks.

Numbers: Quality: MACNET-RANDOM 0.6522 vs AGENTVERSE 0.5805 (Table 1).

Use attention-equipped diffusion models to learn coordinated multi-agent policies and predict joint trajectories from offline logs

0.60

0.70

0.50

5

MADiff can learn coordinated policies and reliable joint trajectory predictions from logs, enabling product features where online trials are costly or unsafe; it's best for small teams and stable environments.

Key finding

MADiff greatly improves multi-agent trajectory prediction on the NBA dataset.

Numbers: ADE 7.92 ± 0.86 vs 15.15 ± 0.38 (Baller2Vec++), traj len 20

A configurable multi-agent framework that adds persona trees and a skill-backed cognitive architecture to make LLM agents act more human in場

0.40

0.60

0.50

5

CGMI lets product teams simulate social workflows (training, UX, game NPCs, edtech) with more realistic agent behavior by adding persona trees and memory-driven planning.

Key finding

Teacher utterances dominated classroom discourse in simulated lessons.

Numbers: Teacher behavior averaged 61.23% of discourse (across C1–C3).

Autonomously evolve multi‑agent AI systems using iterative LLM feedback (Llama 3.2-3B)

0.60

0.50

0.70

5

Automating agent‑level tuning reduces manual engineering, improves output quality and consistency, and scales agentic solutions across domains.

Key finding

Evolved systems show median evaluation scores near or above 0.9 on key criteria across case studies.

Numbers: median ≥ 0.9 across multiple case studies

WebPilot: MCTS-inspired multi-agent system that decomposes web tasks and uses reflective search to improve web automation

0.60

0.70

0.50

4

WebPilot improves success on realistic, multi-step web automation by decomposing tasks and using reflection-guided search, which reduces rework and increases reliability for complex automation workflows.

Key finding

WebPilot (GPT-4o) achieves 37.2% average success rate on WebArena.

Numbers: 37.2% SR (WebArena, GPT-4o, WebPilot)

Learn a sparse communication graph for multi-agent teams; matches full communication while using 40% of edges

0.60

4

Learned sparse communication can cut bandwidth and messaging hardware needs while keeping team performance, so multi-robot warehouses or distributed fleets can save cost and latency without retraining for every topology.

Key finding

CommFormer often matches fully-connected communication while using 40% of edges.

Numbers: S=0.4 (40% edges); many SMAC maps show 100.0% win rate vs FC

A modular blueprint for running reliable multi-agent workflows with planning, tool refinement, and episodic memory

0.60

0.50

0.40

4

Gives a reusable engineering blueprint to run reliable, auditable multi-agent automation across existing enterprise systems without retraining models.

Key finding

Narrow, persona-like agents perform more reliably than broad agents.

How large language models (LLMs) are being used to coordinate, plan, and control teams of robots

0.30

0.60

0.50

4

LLMs can speed up multi-robot coordination and simplify human instructions, but current limitations (math errors, hallucinations, latency) mean companies should pilot hybrid systems that pair LLMs for planning with verified controllers for execution.

Key finding

LLMs are being used at four operational levels in MRS: task allocation, motion planning, action generation, and human-in-the-loop.

Make LLMs argue: multi-model round-table + confidence-weighted voting improves reasoning

0.60

0.70

0.40

4

Combining multiple different LLMs in short, guided discussions yields consistent accuracy lifts on many reasoning tasks; this can improve product QA, decision support, and complex extraction when accuracy matters more than per-request cost.

Key finding

RECONCILE boosts team accuracy on Date Understanding by a large margin versus a leading multi-agent debate baseline.

Numbers: 75.3 → 86.7 (+11.4pp)

DrugAgent: a multi-agent LLM system that combines ML, knowledge graphs, and web search to predict and explain drug-target interactions

0.45

0.60

0.35

3

Combining ML, knowledge graphs, and literature with explicit reasoning yields fewer false positives and clearer explanations, which reduces wasted lab validation and speeds decision-making in drug discovery.

Key finding

DrugAgent improves balanced DTI prediction vs a non-reasoning LLM baseline.

Numbers: F1 0.514 vs 0.355 (≈+45% relative) on evaluated kinase–compound subsets