Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.65
Citation Count
8
Why It Matters For Business
Multi-agent LLM systems can automate and speed up routine engineering tasks, lowering prototyping cost and time; but scale and correctness limits mean human oversight is still required for complex or safety-critical work.
Summary TLDR
This paper surveys 71 recent papers on LLM-based multi-agent (LMA) systems in software engineering, runs two case studies with ChatDev (snake and Tetris), and proposes a two-phase research agenda: (1) improve single-agent role skills and prompting, (2) optimize agent collaboration, scaling, privacy, and evaluation. Case studies show LMAs are fast and cheap for moderate tasks (snake: 76s, $0.019) but struggle with deeper logic (tetris: success at 10th run; missing row-removal). The paper calls for new benchmarks, role-specific fine-tuning, agent-oriented prompting languages, and privacy controls.
Problem Statement
Single LLMs are limited for complex, multi-domain software engineering tasks. The field lacks a systematic map of LLM multi-agent work, realistic benchmarks that test collaboration, and engineering recipes for role specialization, scaling, privacy, and human-agent division of labor.
Main Contribution
Systematic review of 71 primary studies on LLM-based multi-agent systems for software engineering.
Two hands-on case studies using ChatDev (GPT-3.5-turbo) to evaluate practical strengths and limits.
A structured research agenda with two phases: enhance individual agents and optimize agent synergy (scaling, evaluation, privacy, human-agent collaboration).
Key Findings
Surveyed 71 recent primary studies on LMA in software engineering.
ChatDev produced a playable Snake game quickly and cheaply.
ChatDev struggled on a more complex task and required many attempts.
Most current benchmarks and evaluations focus on isolated tasks, not multi-agent collaboration.
Key open gaps: role specialization, agent-oriented prompting languages, scaling, privacy, and dynamic adaptation.
Results
Snake game completion time
Snake game cost
Tetris runs to functional result
Tetris cost
Who Should Care
What To Try In 7 Days
Run an LMA prototype (e.g., ChatDev) on a small, well-specified feature to measure time and cost.
Define 2–3 agent roles (planner, coder, tester) and run iterative cycles to see failure modes.
Audit available project data for privacy-sensitive content and prototype a simple access control before agent onboarding.
Agent Features
Memory
- Short-term memory (current session)
- Long-term memory (experience/history)
- Retrieval memory (repo or docs)
Planning
- Centralized Planning
- Decentralized Planning
Tool Use
- Compiler and test tool integration
- Static analyzers
- Retrieval agents for repo search
Frameworks
- ChatDev
- MetaGPT
- AutoGen
- LangChain
Is Agentic
true
Architectures
- Orchestration platform + agents
- Hierarchical agent teams
- Centralized planning / Decentralized execution
Collaboration
- Agent Communication
- Multi-agent Coordination
- Debate and cross-validation
Optimization Features
Token Efficiency
- Prompt design and summarization to reduce context size
Infra Optimization
- Message prioritization to reduce communication overhead
System Optimization
- Scale horizontally by adding agents
- Central knowledge repository to avoid inconsistencies
Training Optimization
- Parameter-efficient fine-tuning (PEFT) suggested
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Empirical validation is limited to two small case studies with ChatDev.
- No public release of experimental code or reproducible datasets.
- Benchmarks for multi-agent collaboration are missing, limiting objective comparison.
- Privacy, security, and industrial integration are discussed but not experimentally resolved.
When Not To Use
- For safety-critical systems without strict privacy and correctness guarantees.
- As a fully autonomous replacement on complex projects requiring deep abstraction.
- When project data cannot be shared or requires strict regulatory controls.
Failure Modes
- Hallucination or incorrect logic that passes casual review
- Consensus on wrong solution due to correlated agent errors
- Communication bottlenecks and information divergence with many agents
- Privacy leaks if data access is not controlled
Core Entities
Models
- GPT-3.5-turbo
- ChatGPT
- LLaMA
- Claude
- Gemini
Metrics
- time_per_run
- cost_per_run
- iteration_count
Context Entities
Models
- GPT-4 (referenced)
- Codex (context)
Metrics
- consensus_score
- communication_efficiency
Datasets
- historical project logs (referenced as source for experiential co-learning)
Benchmarks
- Livecodebench
- Bigcodebench

