ChatDev: multi-agent LLMs that chat to design, code, and test software

Overview

Decision SnapshotNeeds Validation

The system shows clear gains on averaged SRDD tasks and ablations, but it targets prototypes: agents still need detailed requirements and cost more tokens and time than single-agent setups.

Citations69

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 50%

Authors

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ChatDev makes prototyping software faster and more reliable by combining role-based LLM agents into a chained workflow that raises the chance code runs without heavy manual fixes.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

ChatDev is a multi-agent framework that uses chat-style, role-based LLM agents to build software through three chained phases: design, coding, and testing. It adds a chat chain to divide tasks and a "communicative dehallucination" pattern where assistants ask clarifying questions to reduce coding errors. On a 1,200-task dataset (SRDD) ChatDev outperforms single- and other multi-agent baselines on completeness, executability, and overall quality, but it needs clear requirements and uses more tokens/time than single-agent methods.

Problem Statement

Software development involves multiple phases (design, coding, testing) and diverse roles. Prior ML work focuses on single phases with bespoke models, creating technical fragmentation. LLMs can play roles but tend to hallucinate in code (incomplete or unexecutable outputs). The paper asks: can a unified, language-based multi-agent system reliably produce more complete and executable software while reducing coding hallucinations?

Main Contribution

ChatDev: a chat-powered multi-agent framework that chains design, coding, and testing into sequential subtasks and uses role-based instructor/assistant pairs.

Communicative dehallucination: a dialog pattern where assistants proactively ask for specifics to avoid coding hallucinations.

Key Findings

ChatDev generates more runnable software than baselines.

NumbersExecutability: ChatDev 0.88 vs GPT-Engineer 0.3583, MetaGPT 0.4145

Practical UseUse multi-agent chat chains and testing phases to sharply increase the chance generated projects run out-of-the-box.

Evidence RefTable 1

Overall software quality (product of metrics) improved materially.

NumbersQuality: ChatDev 0.3953 vs MetaGPT 0.1523 (approx. +0.24)

Practical UseChaining design→code→test with role-based agents produces more complete, consistent, and executable software on evaluated tasks.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Completeness	0.5600	GPT-Engineer 0.5022; MetaGPT 0.4834	≈+0.06 vs GPT-Engineer	SRDD (averaged over 1,200 tasks)	Table 1 reports averaged completeness across tasks	Table 1
Executability	0.8800	GPT-Engineer 0.3583; MetaGPT 0.4145	+0.5217 vs GPT-Engineer	SRDD (averaged over 1,200 tasks)	Portion of generated projects that compile and run	Table 1

What To Try In 7 Days

Run the ChatDev repo on 5 small prompts from SRDD to compare outputs with your current pipeline

Implement instructor/assistant role prompts and measure executability on simple prototypes

Add a clarifying-question step (dehallucination) before code commits to reduce unexecutable code runs in CI tests

Agent Features

Memory

Short-term memory per phase (dialog continuity)Long-term memory as saved subtask solutions

Planning

Multi-turn planning via chat chain

Tool Use

Python runtime for compile/run feedbackInception prompting to seed dialog

Frameworks

Chat chainCommunicative dehallucinationInception prompting

Is Agentic

Yes

Architectures

LLM-powered agents (ChatGPT-3.5)

Collaboration

Paired instructor/assistant rolesChain-structured phase-to-phase handoff

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/OpenBMB/ChatDev

Data URLs

https://github.com/OpenBMB/ChatDev

Risks & Boundaries

Limitations

Agents tend to implement simple logic; unclear requirements yield low-information outputs.

Current system is more suited to prototypes than complex, production-grade systems.

When Not To Use

For complex, safety-critical production systems without human oversight

When requirements are vague or underspecified

Failure Modes

Coding hallucinations: incomplete or unexecutable code

Missing imports or 'method not implemented' placeholders

Core Entities

Models

ChatGPT-3.5 (used as agent model)GPT-4 (used as automatic evaluator)

Metrics

CompletenessExecutabilityConsistencyQuality (product of three metrics)

Datasets

SRDD (Software Requirement Description Dataset, 1,200 prompts)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ChatDev generates more runnable software than baselines.

Overall software quality (product of metrics) improved materially.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding