Survey and roadmap for LLM-based multi-agent systems applied to software engineering

April 7, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.65

Citation Count

8

Authors

Junda He, Christoph Treude, David Lo

Links

Abstract / PDF

Why It Matters For Business

Multi-agent LLM systems can automate and speed up routine engineering tasks, lowering prototyping cost and time; but scale and correctness limits mean human oversight is still required for complex or safety-critical work.

Summary TLDR

This paper surveys 71 recent papers on LLM-based multi-agent (LMA) systems in software engineering, runs two case studies with ChatDev (snake and Tetris), and proposes a two-phase research agenda: (1) improve single-agent role skills and prompting, (2) optimize agent collaboration, scaling, privacy, and evaluation. Case studies show LMAs are fast and cheap for moderate tasks (snake: 76s, $0.019) but struggle with deeper logic (tetris: success at 10th run; missing row-removal). The paper calls for new benchmarks, role-specific fine-tuning, agent-oriented prompting languages, and privacy controls.

Problem Statement

Single LLMs are limited for complex, multi-domain software engineering tasks. The field lacks a systematic map of LLM multi-agent work, realistic benchmarks that test collaboration, and engineering recipes for role specialization, scaling, privacy, and human-agent division of labor.

Main Contribution

Systematic review of 71 primary studies on LLM-based multi-agent systems for software engineering.

Two hands-on case studies using ChatDev (GPT-3.5-turbo) to evaluate practical strengths and limits.

A structured research agenda with two phases: enhance individual agents and optimize agent synergy (scaling, evaluation, privacy, human-agent collaboration).

Key Findings

Surveyed 71 recent primary studies on LMA in software engineering.

Numbers71 primary studies (41 identified then +30 via snowballing)

ChatDev produced a playable Snake game quickly and cheaply.

NumbersSnake: avg 76 seconds, $0.019 per run

ChatDev struggled on a more complex task and required many attempts.

NumbersTetris: success only on 10th attempt; avg 70s, $0.020 per run; missing row-removal logic

Most current benchmarks and evaluations focus on isolated tasks, not multi-agent collaboration.

NumbersBenchmarks mainly for single-task code generation (multiple refs), few multi-agent collaboration benchmarks

Key open gaps: role specialization, agent-oriented prompting languages, scaling, privacy, and dynamic adaptation.

NumbersIdentified as core research agenda items across two phases (multiple sections)

Results

Snake game completion time

Value76 seconds (avg)

Snake game cost

Value$0.019 per run (avg)

Tetris runs to functional result

Value10 runs until partial success

Tetris cost

Value$0.020 per run (avg)

Who Should Care

What To Try In 7 Days

Run an LMA prototype (e.g., ChatDev) on a small, well-specified feature to measure time and cost.

Define 2–3 agent roles (planner, coder, tester) and run iterative cycles to see failure modes.

Audit available project data for privacy-sensitive content and prototype a simple access control before agent onboarding.

Agent Features

Memory

  • Short-term memory (current session)
  • Long-term memory (experience/history)
  • Retrieval memory (repo or docs)

Planning

  • Centralized Planning
  • Decentralized Planning

Tool Use

  • Compiler and test tool integration
  • Static analyzers
  • Retrieval agents for repo search

Frameworks

  • ChatDev
  • MetaGPT
  • AutoGen
  • LangChain

Is Agentic

true

Architectures

  • Orchestration platform + agents
  • Hierarchical agent teams
  • Centralized planning / Decentralized execution

Collaboration

  • Agent Communication
  • Multi-agent Coordination
  • Debate and cross-validation

Optimization Features

Token Efficiency

  • Prompt design and summarization to reduce context size

Infra Optimization

  • Message prioritization to reduce communication overhead

System Optimization

  • Scale horizontally by adding agents
  • Central knowledge repository to avoid inconsistencies

Training Optimization

  • Parameter-efficient fine-tuning (PEFT) suggested

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Empirical validation is limited to two small case studies with ChatDev.
  • No public release of experimental code or reproducible datasets.
  • Benchmarks for multi-agent collaboration are missing, limiting objective comparison.
  • Privacy, security, and industrial integration are discussed but not experimentally resolved.

When Not To Use

  • For safety-critical systems without strict privacy and correctness guarantees.
  • As a fully autonomous replacement on complex projects requiring deep abstraction.
  • When project data cannot be shared or requires strict regulatory controls.

Failure Modes

  • Hallucination or incorrect logic that passes casual review
  • Consensus on wrong solution due to correlated agent errors
  • Communication bottlenecks and information divergence with many agents
  • Privacy leaks if data access is not controlled

Core Entities

Models

  • GPT-3.5-turbo
  • ChatGPT
  • LLaMA
  • Claude
  • Gemini

Metrics

  • time_per_run
  • cost_per_run
  • iteration_count

Context Entities

Models

  • GPT-4 (referenced)
  • Codex (context)

Metrics

  • consensus_score
  • communication_efficiency

Datasets

  • historical project logs (referenced as source for experiential co-learning)

Benchmarks

  • Livecodebench
  • Bigcodebench