MetaGPT: use human-style SOPs, role agents and runtime execution checks to improve multi-agent code generation

August 1, 20237 min

Overview

Decision SnapshotReady For Pilot

MetaGPT combines concrete engineering practices (SOPs, structured outputs) with runtime execution checks; evidence shows higher Pass@1 and better executability, but wider real-world deployment needs UI/front-end agents and cost trade-offs for heavy token use.

Citations130

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, Jürgen Schmidhuber

Links

Abstract / PDF / Code

Why It Matters For Business

MetaGPT applies team-style SOPs and runtime test loops to LLM agents, producing more runnable code and fewer manual fixes—trade higher token costs for reduced engineering review time and higher delivery quality.

Who Should Care

Summary TLDR

MetaGPT is an open-source framework that organizes LLMs as a simulated software company. It enforces role specialization (Product Manager, Architect, Project Manager, Engineer, QA), structured outputs (documents and diagrams), a shared message pool with subscriptions, and an executable-feedback loop (run tests and fix code up to 3 retries). These policies (called SOPs — Standard Operating Procedures, i.e., fixed team workflows and output formats) reduce hallucinations and improve executable code. On public code benchmarks the authors report Pass@1 rates of 85.9% and 87.7% and measurable gains from the runtime feedback loop (+4.2% / +5.4% Pass@1). On a custom SoftwareDev suite MetaGPT yields

Problem Statement

Current multi-agent, LLM-driven systems often devolve into noisy chat, inconsistent handovers, and cascading hallucinations when solving complex software tasks. This stems from free-form agent chat, lack of role standards, and no runtime self-correction. MetaGPT aims to fix this by imposing human-like SOPs, structured messages, role specialization, and executable feedback to make multi-agent programming more reliable.

Main Contribution

A meta-programming multi-agent framework (MetaGPT) that models agents as specialized company roles and enforces SOPs for consistent handovers.

Structured communication: agents publish structured documents/diagrams to a shared message pool and subscribe to role-relevant items.

Key Findings

High functional accuracy on public code benchmarks.

NumbersPass@1 = 85.9% and 87.7% on evaluated benchmarks

Practical UseUse MetaGPT's SOP-driven multi-agent pipeline to raise single-attempt correct code rates on standard code-generation tasks.

Evidence RefAbstract; Sec.4.2; Fig.4

Executable feedback improves correctness.

Numbers+4.2% (HumanEval), +5.4% (MBPP) Pass@1

Practical UseAdd runtime test-and-fix loops to reduce post-generation bugs and increase pass rates.

Evidence RefSec.4.4; Ablation text

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pass@1 (public code benchmarks)85.9% / 87.7%Prior chat-based multi-agent systems / GPT-4 aloneReported SoTA improvement vs earlier multi-agent frameworksHumanEval / MBPPAbstract; Sec.4.2; Fig.4Fig.4; Sec.4.2
Executable feedback impact+4.2% / +5.4% Pass@1MetaGPT without feedback+4.2% (HumanEval), +5.4% (MBPP)HumanEval / MBPPSec.4.4 AblationSec.4.4

What To Try In 7 Days

Clone MetaGPT repo and run a simple SoftwareDev task (e.g., drawing app) to see end-to-end output.

Define 3 roles (PM, Architect, Engineer) and enforce structured PRD→design→code handovers.

Enable executable feedback so agents run unit tests and iterate (3 retries) and measure reduction in manual fixes.

Agent Features

Memory
execution/debug historyhandover summaries stored as long-term memory
Planning
React-style loop (reason+act)task decomposition into PRD, design, tasks, code, QA
Tool Use
web search toolcode execution/REPL for testsdiagram and debugging tools
Frameworks
SOP enforcementexecutable feedbacksubscription-based message filtering
Is Agentic

Yes

Architectures
role-specialized agent pipelineassembly-line (SOP) workflow
Collaboration
shared message pool (publish-subscribe)structured document/diagram handovers

Optimization Features

Token Efficiency
subscription filters to reduce irrelevant context
System Optimization
publish-subscribe pool to reduce redundant messaging

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

No dedicated UI/frontend or multimodal agents yet, limiting real-world web UI tasks.

Higher token usage increases operational cost compared to lighter agent setups.

When Not To Use

When strict low-cost constraints forbid higher token/API consumption.

For heavy frontend or multimodal tasks until UI agents are added.

Failure Modes

Hallucinated or incomplete role outputs causing missing dependencies or wrong interfaces.

Cross-file dependency and import errors not caught without sufficient tests.

Core Entities

Models

GPT-4gpt-3.5-turboDeepseek Coder 33BCodeXPaLM

Metrics

Pass@1Executability score (1-4)Token usageRunning time (s)Human revision costProductivity (tokens per code line)

Datasets

HumanEvalMBPPSoftwareDev (proprietary)

Benchmarks

HumanEvalMBPPSoftwareDev