ChatDev: multi-agent LLMs that chat to design, code, and test software

July 16, 20236 min

Overview

Decision SnapshotNeeds Validation

The system shows clear gains on averaged SRDD tasks and ablations, but it targets prototypes: agents still need detailed requirements and cost more tokens and time than single-agent setups.

Citations69

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 50%

Authors

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ChatDev makes prototyping software faster and more reliable by combining role-based LLM agents into a chained workflow that raises the chance code runs without heavy manual fixes.

Who Should Care

Summary TLDR

ChatDev is a multi-agent framework that uses chat-style, role-based LLM agents to build software through three chained phases: design, coding, and testing. It adds a chat chain to divide tasks and a "communicative dehallucination" pattern where assistants ask clarifying questions to reduce coding errors. On a 1,200-task dataset (SRDD) ChatDev outperforms single- and other multi-agent baselines on completeness, executability, and overall quality, but it needs clear requirements and uses more tokens/time than single-agent methods.

Problem Statement

Software development involves multiple phases (design, coding, testing) and diverse roles. Prior ML work focuses on single phases with bespoke models, creating technical fragmentation. LLMs can play roles but tend to hallucinate in code (incomplete or unexecutable outputs). The paper asks: can a unified, language-based multi-agent system reliably produce more complete and executable software while reducing coding hallucinations?

Main Contribution

ChatDev: a chat-powered multi-agent framework that chains design, coding, and testing into sequential subtasks and uses role-based instructor/assistant pairs.

Communicative dehallucination: a dialog pattern where assistants proactively ask for specifics to avoid coding hallucinations.

Key Findings

ChatDev generates more runnable software than baselines.

NumbersExecutability: ChatDev 0.88 vs GPT-Engineer 0.3583, MetaGPT 0.4145

Practical UseUse multi-agent chat chains and testing phases to sharply increase the chance generated projects run out-of-the-box.

Evidence RefTable 1

Overall software quality (product of metrics) improved materially.

NumbersQuality: ChatDev 0.3953 vs MetaGPT 0.1523 (approx. +0.24)

Practical UseChaining design→code→test with role-based agents produces more complete, consistent, and executable software on evaluated tasks.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Completeness0.5600GPT-Engineer 0.5022; MetaGPT 0.4834≈+0.06 vs GPT-EngineerSRDD (averaged over 1,200 tasks)Table 1 reports averaged completeness across tasksTable 1
Executability0.8800GPT-Engineer 0.3583; MetaGPT 0.4145+0.5217 vs GPT-EngineerSRDD (averaged over 1,200 tasks)Portion of generated projects that compile and runTable 1

What To Try In 7 Days

Run the ChatDev repo on 5 small prompts from SRDD to compare outputs with your current pipeline

Implement instructor/assistant role prompts and measure executability on simple prototypes

Add a clarifying-question step (dehallucination) before code commits to reduce unexecutable code runs in CI tests

Agent Features

Memory
Short-term memory per phase (dialog continuity)Long-term memory as saved subtask solutions
Planning
Multi-turn planning via chat chain
Tool Use
Python runtime for compile/run feedbackInception prompting to seed dialog
Frameworks
Chat chainCommunicative dehallucinationInception prompting
Is Agentic

Yes

Architectures
LLM-powered agents (ChatGPT-3.5)
Collaboration
Paired instructor/assistant rolesChain-structured phase-to-phase handoff

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Agents tend to implement simple logic; unclear requirements yield low-information outputs.

Current system is more suited to prototypes than complex, production-grade systems.

When Not To Use

For complex, safety-critical production systems without human oversight

When requirements are vague or underspecified

Failure Modes

Coding hallucinations: incomplete or unexecutable code

Missing imports or 'method not implemented' placeholders

Core Entities

Models

ChatGPT-3.5 (used as agent model)GPT-4 (used as automatic evaluator)

Metrics

CompletenessExecutabilityConsistencyQuality (product of three metrics)

Datasets

SRDD (Software Requirement Description Dataset, 1,200 prompts)