Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
85
Why It Matters For Business
If accuracy matters more than latency, running several LLM copies that debate can materially reduce wrong answers and hallucinations, producing higher-quality outputs for QA, math, and plan generation.
Summary TLDR
Run several copies of a language model as independent "agents" that propose answers, read and critique each other's outputs, and iterate for a few rounds. On math, chess, and factual tasks this multiagent debate consistently improves final accuracy over single-model generation or simple reflection. Gains are shown with black-box access to chatGPT-style models and with the same prompts across tasks. The method costs more compute and sometimes converges confidently to a wrong consensus, so use it when accuracy matters more than latency or cost.
Problem Statement
Large language models often produce confident but incorrect facts and make reasoning mistakes. Existing single-model prompting fixes help but still fail. The paper asks whether independent model instances that iteratively critique each other can reach more accurate, less hallucinatory answers without internal model access.
Main Contribution
Introduce a multiagent debate procedure: multiple LLM instances propose answers, see others' replies, critique, and update over rounds.
Show consistent accuracy gains on reasoning tasks (arithmetic, GSM8K, chess) and factual tasks (biographies, MMLU, chess validity) using black-box model calls and identical prompts.
Provide an evaluation dataset of 524 computer scientist biographies to measure factual hallucinations and analyze debate dynamics, agent counts, rounds, and prompts.
Key Findings
Multiagent debate raises arithmetic accuracy from 67.0% to 81.8% on their test set.
Debate improves grade-school math (GSM8K) accuracy from 77.0% to 85.0% in zero-shot settings.
On factual tasks, debate raised biography agreement from 66.0% to 73.8% and MMLU from 63.9% to 71.1%.
Chess move quality improved materially: predicted-move pawn score rose from 91.4 to 122.9 and move validity from 29.3% to 45.2%.
More agents and more rounds generally improve performance, with diminishing returns beyond ~4 rounds.
Debate works with black-box APIs using identical prompts across tasks and benefits from chain-of-thought prompting.
Debates sometimes converge to confidently wrong answers and long debates can force models to ignore older context.
Results
Accuracy
Accuracy
Chess predicted-move pawn score (Stockfish ∆PS)
Biographies factual agreement
Accuracy
Chess move validity
Who Should Care
What To Try In 7 Days
Run a 3-agent, 2-round debate using your existing LLM API and measure task accuracy vs single calls.
Combine debate with chain-of-thought prompts and compare solved/math accuracy on a small benchmark.
Summarize debate transcripts and fine-tune or distill them into training data to reduce inference cost later.
Agent Features
Memory
- short-term debate context per round
- optional summarization of past rounds to reduce context
Planning
- iterative rounds of critique and update
- consensus formation across agents
Tool Use
- black-box API calls only
- no access to model internals (likelihoods/gradients)
Frameworks
- prompt templates for initial and debate rounds
- summarization to compress many-agent replies
Is Agentic
true
Architectures
- multi-instance LLMs (multiple copies of the same model)
- heterogeneous-agent debate (mixing models e.g., chatGPT + Bard)
Collaboration
- agents read others' outputs and revise
- final answer emerges as consensus
Optimization Features
Token Efficiency
- summarize many-agent replies to stay within context limits
System Optimization
- trade off number of agents and rounds vs compute cost
Inference Optimization
- response summarization to reduce context length
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Higher inference cost: requires multiple agents and rounds.
- Debates can converge confidently to an incorrect consensus.
- Long debates risk models ignoring early context unless you summarize.
- Results are reported mainly with gpt-3.5-turbo; generality to all models is not fully proven.
When Not To Use
- Low-latency or resource-constrained environments where cost matters more than accuracy.
- Tasks where quick single-step answers suffice.
- Cases where final answers cannot be audited and mistaken consensus is unacceptable.
Failure Modes
- Agents unanimously settle on a wrong but internally consistent answer.
- Models focus only on recent rounds and drop critical earlier corrections.
- Debate amplifies shared internal biases instead of correcting errors.
Core Entities
Models
- gpt-3.5-turbo-0301
- chatGPT
- Bard
Metrics
- Accuracy
- Stockfish pawn score (∆PS)
- Validity (%)
- Consensus rate
Datasets
- GSM8K
- MMLU
- BIG-Bench chess-state
- Biographies (524 computer scientists, introduced here)
- Synthetic arithmetic tasks
Benchmarks
- Arithmetic tasks
- GSM8K
- MMLU
- Chess move prediction (pawn score)
- Chess move validity
- Biographies factuality

