156 papers found

ChatDev: multi-agent LLMs that chat to design, code, and test software

0.40
0.50
0.30
69

ChatDev makes prototyping software faster and more reliable by combining role-based LLM agents into a chained workflow that raises the chance code runs without heavy manual fixes.

Key finding

ChatDev generates more runnable software than baselines.

Numbers: Executability: ChatDev 0.88 vs GPT-Engineer 0.3583, MetaGPT 0.4145

A modular graph framework that lets multiple LLM agents collaborate, create agents, and supervise each other

0.30
0.60
0.40
50

Modular LLM agents let teams split complex workflows, add verifiers to reduce costly errors, and plug in APIs safely — but they add orchestration costs and governance requirements.

Key finding

Agents can be modeled as tuples (L, R, S, C, H) to standardize behavior and permissions.

PIANO: a concurrent, bottlenecked agent brain that scales to 10–1000+ agents and yields specialization, laws, and cultural spread in sandbox

0.20
0.70
0.60
10

PIANO shows how modular, concurrent agent brains plus a small coordination bottleneck produce coherent multi-stream behavior at scale. This matters for products that require many autonomous agents to self-organize, coordinate, or influence user communities—e.g., simulation platforms, game NPCs, synthetic user testing,社

Key finding

Single-agent item progression: agents with full PIANO acquired on average 17 unique Minecraft items after 30 minutes.

Numbers: avg 17 unique items / agent @ 30 min (Figure 5A)

A benchmark showing LLMs can coordinate by reading environments but struggle at partners' beliefs and joint planning

0.30
0.60
0.40
10

LLMs can act as zero-shot coordination partners for tasks where the environment dictates the correct action (logistics routing, scripted multi-robot tasks), cutting training time; but they are unreliable when partner modeling or multi-step joint planning is required.

Key finding

LLM agents match or exceed RL on environment-driven Overcooked layouts.

Numbers: GPT-4-turbo: 260 (AA layout) vs PBT: 190 (Table 1)

Small groups of LLM agents that debate early beat naïve scaling; round consistency and 3×3 setups save tokens

0.60
0.55
0.65
9

Multi-agent LLM setups can raise reasoning accuracy without just scaling model size; using small teams (3 agents) and debate-first protocols often gives better answers while controlling API token costs.

Key finding

Debate-initial or debate-dominant strategies give higher accuracy on reasoning benchmarks.

Numbers: MMLU: p0p0p1 = 65.2 vs p1p0p0 = 34.4 (S4 example)

Survey and roadmap for LLM-based multi-agent systems applied to software engineering

0.40
0.60
0.65
8

Multi-agent LLM systems can automate and speed up routine engineering tasks, lowering prototyping cost and time; but scale and correctness limits mean human oversight is still required for complex or safety-critical work.

Key finding

Surveyed 71 recent primary studies on LMA in software engineering.

Numbers: 71 primary studies (41 identified then +30 via snowballing)

A compact map of context-aware multi-agent systems and the five capabilities agents need to work reliably in dynamic settings

0.30
0.40
0.30
6

Context-aware multi-agent design increases robustness and scalability for distributed automation, but requires upfront choices on organization, communication and privacy to avoid noisy or insecure data sharing.

Key finding

CA-MAS design revolves around five agent phases: Sense, Learn, Reason, Predict, Act.

Numbers: 5 phases named explicitly in Section 4.2

Use attention-equipped diffusion models to learn coordinated multi-agent policies and predict joint trajectories from offline logs

0.60
0.70
0.50
5

MADiff can learn coordinated policies and reliable joint trajectory predictions from logs, enabling product features where online trials are costly or unsafe; it's best for small teams and stable environments.

Key finding

MADiff greatly improves multi-agent trajectory prediction on the NBA dataset.

Numbers: ADE 7.92 ± 0.86 vs 15.15 ± 0.38 (Baller2Vec++), traj len 20

A configurable multi-agent framework that adds persona trees and a skill-backed cognitive architecture to make LLM agents act more human in場

0.40
0.60
0.50
5

CGMI lets product teams simulate social workflows (training, UX, game NPCs, edtech) with more realistic agent behavior by adding persona trees and memory-driven planning.

Key finding

Teacher utterances dominated classroom discourse in simulated lessons.

Numbers: Teacher behavior averaged 61.23% of discourse (across C1–C3).

Learn a sparse communication graph for multi-agent teams; matches full communication while using 40% of edges

0.60
0.60
0.60
4

Learned sparse communication can cut bandwidth and messaging hardware needs while keeping team performance, so multi-robot warehouses or distributed fleets can save cost and latency without retraining for every topology.

Key finding

CommFormer often matches fully-connected communication while using 40% of edges.

Numbers: S=0.4 (40% edges); many SMAC maps show 100.0% win rate vs FC

DrugAgent: a multi-agent LLM system that combines ML, knowledge graphs, and web search to predict and explain drug-target interactions

0.45
0.60
0.35
3

Combining ML, knowledge graphs, and literature with explicit reasoning yields fewer false positives and clearer explanations, which reduces wasted lab validation and speeds decision-making in drug discovery.

Key finding

DrugAgent improves balanced DTI prediction vs a non-reasoning LLM baseline.

Numbers: F1 0.514 vs 0.355 (≈+45% relative) on evaluated kinase–compound subsets

A Petri-net based framework (communication spaces) to design and switch between multi-agent and tightly integrated human-AI systems

0.40
0.60
0.50
3

The framework gives product and engineering teams a concrete way to choose and design either modular multi-agent systems or tightly integrated human-AI workflows, reducing integration errors and clarifying where human oversight must remain.

Key finding

Communication spaces split interaction into surface, observation, and computation layers.

Practical survey and roadmap for four agent interoperability protocols (MCP, ACP, A2A, ANP)

0.70
0.40
0.70
3

Standardizing agent interfaces reduces engineering cost, improves security, and enables reusable agent services across teams and vendors.

Key finding

Four distinct protocols target different interoperability layers

Numbers: 4 protocols compared (MCP, ACP, A2A, ANP)

DebUnc: use uncertainty estimates to steer multi-agent debates by scaling attention to confident agents

0.40
0.60
0.50
3

If you run LLM agent systems, prioritizing more confident agents reduces the chance of the group converging on confidently wrong answers; attention-scaling gives the biggest payoff when you can measure confidence reliably.

Key finding

Attention scaling (Attention-All) improves final debate accuracy when uncertainty estimates are good.

Numbers: Mistral: avg accuracy 0.67 (Attention-All, Oracle) vs 0.53 (standard) → +0.14

An Internet-like platform that links diverse LLM agents into dynamic teams and chat groups

0.60
0.55
0.40
3

IoA lets you combine existing specialized agents into coordinated teams to raise task success without re-training models; expect better QA and tool use at the cost of coordination tokens and some extra infra.

Key finding

IoA substantially improves open-ended instruction wins when it orchestrates third-party agents.

Numbers: Win rate vs AutoGPT: 76.5%; vs Open Interpreter: 63.4%

A single scalar predicts when adding agents helps, stalls, or destroys performance under a fixed compute budget

0.60
0.70
0.60
2

Running many agents isn't always better: under fixed budget and token-limited contexts, coordination cost and shared blind spots can make scale-out hurt. Measure your system's message fidelity and error correlation to decide whether to add agents or invest in longer messages and diversity.

Key finding

Deep hierarchical aggregation exhibits a sharp phase transition: amplification vs collapse is decided by a single scalar α_ρ.

Numbers: α_ρ > 1 => amplification; α_ρ ≤ 1 => collapse

Agentic AI breaks the old rules of human-AI teams — shared awareness helps, but continuous governance is required

0.30
0.70
0.60
2

Agentic AI can change behavior and priorities after deployment; firms must monitor intermediate commitments, add decision checkpoints, and align incentives so automation doesn't drift from strategic goals.

Key finding

Agentic AI creates three structural uncertainties—action trajectories, generative outputs, and evolving objectives—that differ qualitatively from task-bound systems.

Decentralized LLM-powered agents to assist and gradually control accelerator subsystems

1.00
0.80
0.60
2

A modular agent layer can reduce operator time on diagnostics, speed script generation for domain-specific languages, and let facilities pilot automation safely while keeping legacy safety systems intact.

Key finding

A decentralized, agent-based architecture can map cleanly onto accelerator subsystems and operator workflows.

Use a learned manager to steer LLM agents by changing who sees what — raising network cooperation without rewiring links

0.40
0.60
0.60
2

Adaptive control of who sees what is a low-cost governance lever: you can raise coordination among autonomous agents without changing incentives or network wiring, cutting engineering and policy friction.

Key finding

A learned RL manager drives full network cooperation in the simulated PD runs.

Numbers: Reach 100% mutual cooperation (CC) by timestep 10 on average (RL method)

Corex: make multiple LLM agents Discuss, Review and Retrieve to improve complex reasoning

0.60
0.70
0.70
2

Corex can boost accuracy on complex reasoning tasks while cutting inference token costs substantially; that reduces API bills and enables mixing cheaper open-source models with stronger ones for cost-effective pipelines.

Key finding

Retrieve mode with 5 agents improves average math accuracy over strong self-consistency baseline.

Numbers: Math avg: Corex-Retrieve 86.3 vs CoT-SC(10) 84.6 (+1.7 pp)

A drag-and-drop, no-code UI + APIs for building, testing, profiling, and exporting multi-agent workflows

0.30
0.60
0.60
2

AutoGen Studio shortens the gap between idea and working multi-agent prototype. Teams can visually assemble agents, track costs and tool failures, and export workflows to run as APIs or Docker containers. This accelerates experimentation and handoff to engineers while keeping reproducible component specs.

Key finding

Wide early adoption and active feedback loop

Numbers: 200K+ installs in 5 months; >135 GitHub issues

CoThinker: use Cognitive Load Theory to make LLM teams solve high‑load tasks

0.60
0.60
0.60
1

Designing LLM teams with shared memory and structured communication reduces reasoning failures on complex problems, improving solution quality for data analysis and math tasks while requiring careful tuning to avoid extra coordination cost.

Key finding

Attention entropy rises with task complexity, consistent with higher working‑memory demands.

Numbers: Attention entropy: Level1=4.44 → Level3=5.04 → Level4=6.10

How uncertainty can make multi-agent systems ask humans for supervision

0.35
0.60
0.30
1

Designing agents with calibrated uncertainty can force them to request human oversight, lowering risk of harmful autonomous actions while trading off autonomy and throughput.

Key finding

A defending agent is incentivized to ask the human when two derived inequalities (Theorem 1) hold.

WELLA: fine-tuned LLM agents that generate dynamic workload estimates for multi‑operator nuclear control rooms

0.40
0.60
0.50
1

Automating realistic workload data reduces expert labor and enables faster, cheaper safety testing and training for multi-operator control rooms.

Key finding

WELLA predicts per-role workload with very high fit for RO3.

Numbers: RO3 R2=0.9628, RMSE=3.5327, MAE=1.92