MetaGPT: use human-style SOPs, role agents and runtime execution checks to improve multi-agent code generation

August 1, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

130

Authors

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, Jürgen Schmidhuber

Links

Abstract / PDF

Why It Matters For Business

MetaGPT applies team-style SOPs and runtime test loops to LLM agents, producing more runnable code and fewer manual fixes—trade higher token costs for reduced engineering review time and higher delivery quality.

Summary TLDR

MetaGPT is an open-source framework that organizes LLMs as a simulated software company. It enforces role specialization (Product Manager, Architect, Project Manager, Engineer, QA), structured outputs (documents and diagrams), a shared message pool with subscriptions, and an executable-feedback loop (run tests and fix code up to 3 retries). These policies (called SOPs — Standard Operating Procedures, i.e., fixed team workflows and output formats) reduce hallucinations and improve executable code. On public code benchmarks the authors report Pass@1 rates of 85.9% and 87.7% and measurable gains from the runtime feedback loop (+4.2% / +5.4% Pass@1). On a custom SoftwareDev suite MetaGPT yields

Problem Statement

Current multi-agent, LLM-driven systems often devolve into noisy chat, inconsistent handovers, and cascading hallucinations when solving complex software tasks. This stems from free-form agent chat, lack of role standards, and no runtime self-correction. MetaGPT aims to fix this by imposing human-like SOPs, structured messages, role specialization, and executable feedback to make multi-agent programming more reliable.

Main Contribution

A meta-programming multi-agent framework (MetaGPT) that models agents as specialized company roles and enforces SOPs for consistent handovers.

Structured communication: agents publish structured documents/diagrams to a shared message pool and subscribe to role-relevant items.

Executable feedback: engineers run unit tests, record execution traces, and iteratively debug up to 3 retries to reduce runtime errors.

Empirical gains: reported state-of-the-art Pass@1 numbers on public code benchmarks and clear improvements in a SoftwareDev benchmark versus other agent frameworks.

Open-source release: implementation and demos provided at the project's GitHub.

Key Findings

High functional accuracy on public code benchmarks.

NumbersPass@1 = 85.9% and 87.7% on evaluated benchmarks

Executable feedback improves correctness.

Numbers+4.2% (HumanEval), +5.4% (MBPP) Pass@1

Better end-to-end software executability and lower human fix cost vs ChatDev.

NumbersSoftwareDev executability 3.75 vs ChatDev 2.25; human revision cost 0.83 vs 2.5

Trading tokens for higher-quality code.

NumbersMetaGPT uses more tokens (31,255) but halves tokens-per-code-line (124.3) vs ChatDev (248.9)

Results

Pass@1 (public code benchmarks)

Value85.9% / 87.7%

BaselinePrior chat-based multi-agent systems / GPT-4 alone

Executable feedback impact

Value+4.2% / +5.4% Pass@1

BaselineMetaGPT without feedback

SoftwareDev executability (1-4)

ValueMetaGPT: 3.75

BaselineChatDev: 2.25

Human revision cost (average fixes)

Value0.83 (MetaGPT)

Baseline2.5 (ChatDev)

Productivity (tokens per code line)

Value124.3 (MetaGPT)

Baseline248.9 (ChatDev)

Who Should Care

What To Try In 7 Days

Clone MetaGPT repo and run a simple SoftwareDev task (e.g., drawing app) to see end-to-end output.

Define 3 roles (PM, Architect, Engineer) and enforce structured PRD→design→code handovers.

Enable executable feedback so agents run unit tests and iterate (3 retries) and measure reduction in manual fixes.

Agent Features

Memory

  • execution/debug history
  • handover summaries stored as long-term memory

Planning

  • React-style loop (reason+act)
  • task decomposition into PRD, design, tasks, code, QA

Tool Use

  • web search tool
  • code execution/REPL for tests
  • diagram and debugging tools

Frameworks

  • SOP enforcement
  • executable feedback
  • subscription-based message filtering

Is Agentic

true

Architectures

  • role-specialized agent pipeline
  • assembly-line (SOP) workflow

Collaboration

  • shared message pool (publish-subscribe)
  • structured document/diagram handovers

Optimization Features

Token Efficiency

  • subscription filters to reduce irrelevant context

System Optimization

  • publish-subscribe pool to reduce redundant messaging

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No dedicated UI/frontend or multimodal agents yet, limiting real-world web UI tasks.
  • Higher token usage increases operational cost compared to lighter agent setups.
  • Projects are executed independently; limited cross-project learning in this version.

When Not To Use

  • When strict low-cost constraints forbid higher token/API consumption.
  • For heavy frontend or multimodal tasks until UI agents are added.
  • If you need continuous cross-project lifelong learning (current self-improvement is limited).

Failure Modes

  • Hallucinated or incomplete role outputs causing missing dependencies or wrong interfaces.
  • Cross-file dependency and import errors not caught without sufficient tests.
  • Information overload if subscription filters are poorly tuned, causing irrelevant context to leak in.

Core Entities

Models

  • GPT-4
  • gpt-3.5-turbo
  • Deepseek Coder 33B
  • CodeX
  • PaLM

Metrics

  • Pass@1
  • Executability score (1-4)
  • Token usage
  • Running time (s)
  • Human revision cost
  • Productivity (tokens per code line)

Datasets

  • HumanEval
  • MBPP
  • SoftwareDev (proprietary)

Benchmarks

  • HumanEval
  • MBPP
  • SoftwareDev