Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.3
Citation Count
69
Why It Matters For Business
ChatDev makes prototyping software faster and more reliable by combining role-based LLM agents into a chained workflow that raises the chance code runs without heavy manual fixes.
Summary TLDR
ChatDev is a multi-agent framework that uses chat-style, role-based LLM agents to build software through three chained phases: design, coding, and testing. It adds a chat chain to divide tasks and a "communicative dehallucination" pattern where assistants ask clarifying questions to reduce coding errors. On a 1,200-task dataset (SRDD) ChatDev outperforms single- and other multi-agent baselines on completeness, executability, and overall quality, but it needs clear requirements and uses more tokens/time than single-agent methods.
Problem Statement
Software development involves multiple phases (design, coding, testing) and diverse roles. Prior ML work focuses on single phases with bespoke models, creating technical fragmentation. LLMs can play roles but tend to hallucinate in code (incomplete or unexecutable outputs). The paper asks: can a unified, language-based multi-agent system reliably produce more complete and executable software while reducing coding hallucinations?
Main Contribution
ChatDev: a chat-powered multi-agent framework that chains design, coding, and testing into sequential subtasks and uses role-based instructor/assistant pairs.
Communicative dehallucination: a dialog pattern where assistants proactively ask for specifics to avoid coding hallucinations.
SRDD: a 1,200-prompt Software Requirement Description Dataset spanning 5 categories for evaluation.
Empirical evaluation: comparisons with GPT-Engineer and MetaGPT plus ablations showing component effects.
Key Findings
ChatDev generates more runnable software than baselines.
Overall software quality (product of metrics) improved materially.
Human and automatic judges prefer ChatDev.
Removing dehallucination or roles hurts results.
Results
Completeness
Executability
Consistency
Quality
Who Should Care
What To Try In 7 Days
Run the ChatDev repo on 5 small prompts from SRDD to compare outputs with your current pipeline
Implement instructor/assistant role prompts and measure executability on simple prototypes
Add a clarifying-question step (dehallucination) before code commits to reduce unexecutable code runs in CI tests
Agent Features
Memory
- Short-term memory per phase (dialog continuity)
- Long-term memory as saved subtask solutions
Planning
- Multi-turn planning via chat chain
Tool Use
- Python runtime for compile/run feedback
- Inception prompting to seed dialog
Frameworks
- Chat chain
- Communicative dehallucination
- Inception prompting
Is Agentic
true
Architectures
- LLM-powered agents (ChatGPT-3.5)
Collaboration
- Paired instructor/assistant roles
- Chain-structured phase-to-phase handoff
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Agents tend to implement simple logic; unclear requirements yield low-information outputs.
- Current system is more suited to prototypes than complex, production-grade systems.
- Holistic automated evaluation of arbitrary software remains infeasible; metrics cover completeness, executability, and consistency only.
- Multi-agent runs consume more tokens and time, increasing compute cost and environmental impact.
When Not To Use
- For complex, safety-critical production systems without human oversight
- When requirements are vague or underspecified
- When computational budget or latency constraints are tight
Failure Modes
- Coding hallucinations: incomplete or unexecutable code
- Missing imports or 'method not implemented' placeholders
- Role flipping and instruction repetition in dialogs
- Higher token usage and longer runtimes than single-agent pipelines
Core Entities
Models
- ChatGPT-3.5 (used as agent model)
- GPT-4 (used as automatic evaluator)
Metrics
- Completeness
- Executability
- Consistency
- Quality (product of three metrics)
Datasets
- SRDD (Software Requirement Description Dataset, 1,200 prompts)

