Survey and roadmap for LLM-based multi-agent systems applied to software engineering

Overview

Decision SnapshotNeeds Validation

The paper compiles extensive references and two small case studies; it provides a practical roadmap but lacks large-scale empirical evaluation and public artifacts for replication.

Citations8

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 40%

Novelty: 60%

Authors

Junda He, Christoph Treude, David Lo

Links

Abstract / PDF

Why It Matters For Business

Multi-agent LLM systems can automate and speed up routine engineering tasks, lowering prototyping cost and time; but scale and correctness limits mean human oversight is still required for complex or safety-critical work.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper surveys 71 recent papers on LLM-based multi-agent (LMA) systems in software engineering, runs two case studies with ChatDev (snake and Tetris), and proposes a two-phase research agenda: (1) improve single-agent role skills and prompting, (2) optimize agent collaboration, scaling, privacy, and evaluation. Case studies show LMAs are fast and cheap for moderate tasks (snake: 76s, $0.019) but struggle with deeper logic (tetris: success at 10th run; missing row-removal). The paper calls for new benchmarks, role-specific fine-tuning, agent-oriented prompting languages, and privacy controls.

Problem Statement

Single LLMs are limited for complex, multi-domain software engineering tasks. The field lacks a systematic map of LLM multi-agent work, realistic benchmarks that test collaboration, and engineering recipes for role specialization, scaling, privacy, and human-agent division of labor.

Main Contribution

Systematic review of 71 primary studies on LLM-based multi-agent systems for software engineering.

Two hands-on case studies using ChatDev (GPT-3.5-turbo) to evaluate practical strengths and limits.

Key Findings

Surveyed 71 recent primary studies on LMA in software engineering.

Numbers71 primary studies (41 identified then +30 via snowballing)

Practical UseThere is a substantial and growing body of work—start by auditing these 71 systems before building a new LMA to avoid duplication.

Evidence RefIntroduction; Section 3 search summary

ChatDev produced a playable Snake game quickly and cheaply.

NumbersSnake: avg 76 seconds, $0.019 per run

Practical UseUse LMA tools for rapid prototyping of moderate software tasks to save developer time and cost.

Evidence RefSection 4.1 Case Study: Snake

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Snake game completion time	76 seconds (avg)	—	—	ChatDev case study	Authors ran ChatDev; first attempt failed, second produced playable game; reported avg time	Section 4.1
Snake game cost	$0.019 per run (avg)	—	—	ChatDev case study	Reported API cost per attempt	Section 4.1

What To Try In 7 Days

Run an LMA prototype (e.g., ChatDev) on a small, well-specified feature to measure time and cost.

Define 2–3 agent roles (planner, coder, tester) and run iterative cycles to see failure modes.

Audit available project data for privacy-sensitive content and prototype a simple access control before agent onboarding.

Agent Features

Memory

Short-term memory (current session)Long-term memory (experience/history)Retrieval memory (repo or docs)

Planning

Centralized PlanningDecentralized Planning

Tool Use

Compiler and test tool integrationStatic analyzersRetrieval agents for repo search

Frameworks

ChatDevMetaGPTAutoGenLangChain

Is Agentic

Yes

Architectures

Orchestration platform + agentsHierarchical agent teamsCentralized planning / Decentralized execution

Collaboration

Agent CommunicationMulti-agent CoordinationDebate and cross-validation

Optimization Features

Token Efficiency

Prompt design and summarization to reduce context size

Infra Optimization

Message prioritization to reduce communication overhead

System Optimization

Scale horizontally by adding agentsCentral knowledge repository to avoid inconsistencies

Training Optimization

Parameter-efficient fine-tuning (PEFT) suggested

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Empirical validation is limited to two small case studies with ChatDev.

No public release of experimental code or reproducible datasets.

When Not To Use

For safety-critical systems without strict privacy and correctness guarantees.

As a fully autonomous replacement on complex projects requiring deep abstraction.

Failure Modes

Hallucination or incorrect logic that passes casual review

Consensus on wrong solution due to correlated agent errors

Core Entities

Models

GPT-3.5-turboChatGPTLLaMAClaudeGemini

Metrics

time_per_runcost_per_runiteration_count

Context Entities

Models

GPT-4 (referenced)Codex (context)

Metrics

consensus_scorecommunication_efficiency

Datasets

historical project logs (referenced as source for experiential co-learning)

Benchmarks

LivecodebenchBigcodebench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Surveyed 71 recent primary studies on LMA in software engineering.

ChatDev produced a playable Snake game quickly and cheaply.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding