Overview
The paper compiles extensive references and two small case studies; it provides a practical roadmap but lacks large-scale empirical evaluation and public artifacts for replication.
Citations8
Evidence Strength0.60
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 65%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
Multi-agent LLM systems can automate and speed up routine engineering tasks, lowering prototyping cost and time; but scale and correctness limits mean human oversight is still required for complex or safety-critical work.
Who Should Care
Summary TLDR
This paper surveys 71 recent papers on LLM-based multi-agent (LMA) systems in software engineering, runs two case studies with ChatDev (snake and Tetris), and proposes a two-phase research agenda: (1) improve single-agent role skills and prompting, (2) optimize agent collaboration, scaling, privacy, and evaluation. Case studies show LMAs are fast and cheap for moderate tasks (snake: 76s, $0.019) but struggle with deeper logic (tetris: success at 10th run; missing row-removal). The paper calls for new benchmarks, role-specific fine-tuning, agent-oriented prompting languages, and privacy controls.
Problem Statement
Single LLMs are limited for complex, multi-domain software engineering tasks. The field lacks a systematic map of LLM multi-agent work, realistic benchmarks that test collaboration, and engineering recipes for role specialization, scaling, privacy, and human-agent division of labor.
Main Contribution
Systematic review of 71 primary studies on LLM-based multi-agent systems for software engineering.
Two hands-on case studies using ChatDev (GPT-3.5-turbo) to evaluate practical strengths and limits.
Key Findings
Surveyed 71 recent primary studies on LMA in software engineering.
ChatDev produced a playable Snake game quickly and cheaply.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Snake game completion time | 76 seconds (avg) | — | — | ChatDev case study | Authors ran ChatDev; first attempt failed, second produced playable game; reported avg time | Section 4.1 |
| Snake game cost | $0.019 per run (avg) | — | — | ChatDev case study | Reported API cost per attempt | Section 4.1 |
What To Try In 7 Days
Run an LMA prototype (e.g., ChatDev) on a small, well-specified feature to measure time and cost.
Define 2–3 agent roles (planner, coder, tester) and run iterative cycles to see failure modes.
Audit available project data for privacy-sensitive content and prototype a simple access control before agent onboarding.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Empirical validation is limited to two small case studies with ChatDev.
No public release of experimental code or reproducible datasets.
When Not To Use
For safety-critical systems without strict privacy and correctness guarantees.
As a fully autonomous replacement on complex projects requiring deep abstraction.
Failure Modes
Hallucination or incorrect logic that passes casual review
Consensus on wrong solution due to correlated agent errors

