Overview
The idea is practical and works with black-box LLMs and standard benchmarks, but costs more compute and sometimes yields confidently wrong consensus; audit outputs and distill debates to reduce cost.
Citations85
Evidence Strength0.60
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 6/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
If accuracy matters more than latency, running several LLM copies that debate can materially reduce wrong answers and hallucinations, producing higher-quality outputs for QA, math, and plan generation.
Who Should Care
Summary TLDR
Run several copies of a language model as independent "agents" that propose answers, read and critique each other's outputs, and iterate for a few rounds. On math, chess, and factual tasks this multiagent debate consistently improves final accuracy over single-model generation or simple reflection. Gains are shown with black-box access to chatGPT-style models and with the same prompts across tasks. The method costs more compute and sometimes converges confidently to a wrong consensus, so use it when accuracy matters more than latency or cost.
Problem Statement
Large language models often produce confident but incorrect facts and make reasoning mistakes. Existing single-model prompting fixes help but still fail. The paper asks whether independent model instances that iteratively critique each other can reach more accurate, less hallucinatory answers without internal model access.
Main Contribution
Introduce a multiagent debate procedure: multiple LLM instances propose answers, see others' replies, critique, and update over rounds.
Show consistent accuracy gains on reasoning tasks (arithmetic, GSM8K, chess) and factual tasks (biographies, MMLU, chess validity) using black-box model calls and identical prompts.
Key Findings
Multiagent debate raises arithmetic accuracy from 67.0% to 81.8% on their test set.
Debate improves grade-school math (GSM8K) accuracy from 77.0% to 85.0% in zero-shot settings.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | Multi-Agent Debate 81.8% (±2.3) | Single Agent 67.0% (±4.7) | +14.8 pp | 100 synthetic arithmetic problems | Table 1 reports these mean ± std values | Table 1 |
| Accuracy | Multi-Agent Debate 85.0% (±3.5) | Single Agent 77.0% (±4.2) | +8.0 pp | 100 GSM8K problems | Table 1 reports zero-shot results | Table 1 |
What To Try In 7 Days
Run a 3-agent, 2-round debate using your existing LLM API and measure task accuracy vs single calls.
Combine debate with chain-of-thought prompts and compare solved/math accuracy on a small benchmark.
Summarize debate transcripts and fine-tune or distill them into training data to reduce inference cost later.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Higher inference cost: requires multiple agents and rounds.
Debates can converge confidently to an incorrect consensus.
When Not To Use
Low-latency or resource-constrained environments where cost matters more than accuracy.
Tasks where quick single-step answers suffice.
Failure Modes
Agents unanimously settle on a wrong but internally consistent answer.
Models focus only on recent rounds and drop critical earlier corrections.

