ChatDev: multi-agent LLMs that chat to design, code, and test software

July 16, 20236 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.3

Citation Count

69

Authors

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF

Why It Matters For Business

ChatDev makes prototyping software faster and more reliable by combining role-based LLM agents into a chained workflow that raises the chance code runs without heavy manual fixes.

Summary TLDR

ChatDev is a multi-agent framework that uses chat-style, role-based LLM agents to build software through three chained phases: design, coding, and testing. It adds a chat chain to divide tasks and a "communicative dehallucination" pattern where assistants ask clarifying questions to reduce coding errors. On a 1,200-task dataset (SRDD) ChatDev outperforms single- and other multi-agent baselines on completeness, executability, and overall quality, but it needs clear requirements and uses more tokens/time than single-agent methods.

Problem Statement

Software development involves multiple phases (design, coding, testing) and diverse roles. Prior ML work focuses on single phases with bespoke models, creating technical fragmentation. LLMs can play roles but tend to hallucinate in code (incomplete or unexecutable outputs). The paper asks: can a unified, language-based multi-agent system reliably produce more complete and executable software while reducing coding hallucinations?

Main Contribution

ChatDev: a chat-powered multi-agent framework that chains design, coding, and testing into sequential subtasks and uses role-based instructor/assistant pairs.

Communicative dehallucination: a dialog pattern where assistants proactively ask for specifics to avoid coding hallucinations.

SRDD: a 1,200-prompt Software Requirement Description Dataset spanning 5 categories for evaluation.

Empirical evaluation: comparisons with GPT-Engineer and MetaGPT plus ablations showing component effects.

Key Findings

ChatDev generates more runnable software than baselines.

NumbersExecutability: ChatDev 0.88 vs GPT-Engineer 0.3583, MetaGPT 0.4145

Overall software quality (product of metrics) improved materially.

NumbersQuality: ChatDev 0.3953 vs MetaGPT 0.1523 (approx. +0.24)

Human and automatic judges prefer ChatDev.

NumbersPairwise human wins vs GPT-Engineer: 90.16% ; vs MetaGPT: 88.00%

Removing dehallucination or roles hurts results.

NumbersAblation: remove CDH Quality 0.3094 (from 0.3953); remove roles Quality 0.2212

Results

Completeness

Value0.5600

BaselineGPT-Engineer 0.5022; MetaGPT 0.4834

Executability

Value0.8800

BaselineGPT-Engineer 0.3583; MetaGPT 0.4145

Consistency

Value0.8021

BaselineGPT-Engineer 0.7887; MetaGPT 0.7601

Quality

Value0.3953

BaselineGPT-Engineer 0.1419; MetaGPT 0.1523

Who Should Care

What To Try In 7 Days

Run the ChatDev repo on 5 small prompts from SRDD to compare outputs with your current pipeline

Implement instructor/assistant role prompts and measure executability on simple prototypes

Add a clarifying-question step (dehallucination) before code commits to reduce unexecutable code runs in CI tests

Agent Features

Memory

  • Short-term memory per phase (dialog continuity)
  • Long-term memory as saved subtask solutions

Planning

  • Multi-turn planning via chat chain

Tool Use

  • Python runtime for compile/run feedback
  • Inception prompting to seed dialog

Frameworks

  • Chat chain
  • Communicative dehallucination
  • Inception prompting

Is Agentic

true

Architectures

  • LLM-powered agents (ChatGPT-3.5)

Collaboration

  • Paired instructor/assistant roles
  • Chain-structured phase-to-phase handoff

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Agents tend to implement simple logic; unclear requirements yield low-information outputs.
  • Current system is more suited to prototypes than complex, production-grade systems.
  • Holistic automated evaluation of arbitrary software remains infeasible; metrics cover completeness, executability, and consistency only.
  • Multi-agent runs consume more tokens and time, increasing compute cost and environmental impact.

When Not To Use

  • For complex, safety-critical production systems without human oversight
  • When requirements are vague or underspecified
  • When computational budget or latency constraints are tight

Failure Modes

  • Coding hallucinations: incomplete or unexecutable code
  • Missing imports or 'method not implemented' placeholders
  • Role flipping and instruction repetition in dialogs
  • Higher token usage and longer runtimes than single-agent pipelines

Core Entities

Models

  • ChatGPT-3.5 (used as agent model)
  • GPT-4 (used as automatic evaluator)

Metrics

  • Completeness
  • Executability
  • Consistency
  • Quality (product of three metrics)

Datasets

  • SRDD (Software Requirement Description Dataset, 1,200 prompts)