MetaGPT: use human-style SOPs, role agents and runtime execution checks to improve multi-agent code generation

Overview

Decision SnapshotReady For Pilot

MetaGPT combines concrete engineering practices (SOPs, structured outputs) with runtime execution checks; evidence shows higher Pass@1 and better executability, but wider real-world deployment needs UI/front-end agents and cost trade-offs for heavy token use.

Citations130

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, Jürgen Schmidhuber

Links

Abstract / PDF / Code

Why It Matters For Business

MetaGPT applies team-style SOPs and runtime test loops to LLM agents, producing more runnable code and fewer manual fixes—trade higher token costs for reduced engineering review time and higher delivery quality.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

MetaGPT is an open-source framework that organizes LLMs as a simulated software company. It enforces role specialization (Product Manager, Architect, Project Manager, Engineer, QA), structured outputs (documents and diagrams), a shared message pool with subscriptions, and an executable-feedback loop (run tests and fix code up to 3 retries). These policies (called SOPs — Standard Operating Procedures, i.e., fixed team workflows and output formats) reduce hallucinations and improve executable code. On public code benchmarks the authors report Pass@1 rates of 85.9% and 87.7% and measurable gains from the runtime feedback loop (+4.2% / +5.4% Pass@1). On a custom SoftwareDev suite MetaGPT yields

Problem Statement

Current multi-agent, LLM-driven systems often devolve into noisy chat, inconsistent handovers, and cascading hallucinations when solving complex software tasks. This stems from free-form agent chat, lack of role standards, and no runtime self-correction. MetaGPT aims to fix this by imposing human-like SOPs, structured messages, role specialization, and executable feedback to make multi-agent programming more reliable.

Main Contribution

A meta-programming multi-agent framework (MetaGPT) that models agents as specialized company roles and enforces SOPs for consistent handovers.

Structured communication: agents publish structured documents/diagrams to a shared message pool and subscribe to role-relevant items.

Key Findings

High functional accuracy on public code benchmarks.

NumbersPass@1 = 85.9% and 87.7% on evaluated benchmarks

Practical UseUse MetaGPT's SOP-driven multi-agent pipeline to raise single-attempt correct code rates on standard code-generation tasks.

Evidence RefAbstract; Sec.4.2; Fig.4

Executable feedback improves correctness.

Numbers+4.2% (HumanEval), +5.4% (MBPP) Pass@1

Practical UseAdd runtime test-and-fix loops to reduce post-generation bugs and increase pass rates.

Evidence RefSec.4.4; Ablation text

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pass@1 (public code benchmarks)	85.9% / 87.7%	Prior chat-based multi-agent systems / GPT-4 alone	Reported SoTA improvement vs earlier multi-agent frameworks	HumanEval / MBPP	Abstract; Sec.4.2; Fig.4	Fig.4; Sec.4.2
Executable feedback impact	+4.2% / +5.4% Pass@1	MetaGPT without feedback	+4.2% (HumanEval), +5.4% (MBPP)	HumanEval / MBPP	Sec.4.4 Ablation	Sec.4.4

What To Try In 7 Days

Clone MetaGPT repo and run a simple SoftwareDev task (e.g., drawing app) to see end-to-end output.

Define 3 roles (PM, Architect, Engineer) and enforce structured PRD→design→code handovers.

Enable executable feedback so agents run unit tests and iterate (3 retries) and measure reduction in manual fixes.

Agent Features

Memory

execution/debug historyhandover summaries stored as long-term memory

Planning

React-style loop (reason+act)task decomposition into PRD, design, tasks, code, QA

Tool Use

web search toolcode execution/REPL for testsdiagram and debugging tools

Frameworks

SOP enforcementexecutable feedbacksubscription-based message filtering

Is Agentic

Yes

Architectures

role-specialized agent pipelineassembly-line (SOP) workflow

Collaboration

shared message pool (publish-subscribe)structured document/diagram handovers

Optimization Features

Token Efficiency

subscription filters to reduce irrelevant context

System Optimization

publish-subscribe pool to reduce redundant messaging

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/geekan/MetaGPT

Risks & Boundaries

Limitations

No dedicated UI/frontend or multimodal agents yet, limiting real-world web UI tasks.

Higher token usage increases operational cost compared to lighter agent setups.

When Not To Use

When strict low-cost constraints forbid higher token/API consumption.

For heavy frontend or multimodal tasks until UI agents are added.

Failure Modes

Hallucinated or incomplete role outputs causing missing dependencies or wrong interfaces.

Cross-file dependency and import errors not caught without sufficient tests.

Core Entities

Models

GPT-4gpt-3.5-turboDeepseek Coder 33BCodeXPaLM

Metrics

Pass@1Executability score (1-4)Token usageRunning time (s)Human revision costProductivity (tokens per code line)

Datasets

HumanEvalMBPPSoftwareDev (proprietary)

Benchmarks

HumanEvalMBPPSoftwareDev

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

High functional accuracy on public code benchmarks.

Executable feedback improves correctness.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding