Let multiple copies of an LLM debate to improve reasoning and reduce hallucinations

May 23, 20238 min

Overview

Decision SnapshotNeeds Validation

The idea is practical and works with black-box LLMs and standard benchmarks, but costs more compute and sometimes yields confidently wrong consensus; audit outputs and distill debates to reduce cost.

Citations85

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 60%

Authors

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If accuracy matters more than latency, running several LLM copies that debate can materially reduce wrong answers and hallucinations, producing higher-quality outputs for QA, math, and plan generation.

Who Should Care

Summary TLDR

Run several copies of a language model as independent "agents" that propose answers, read and critique each other's outputs, and iterate for a few rounds. On math, chess, and factual tasks this multiagent debate consistently improves final accuracy over single-model generation or simple reflection. Gains are shown with black-box access to chatGPT-style models and with the same prompts across tasks. The method costs more compute and sometimes converges confidently to a wrong consensus, so use it when accuracy matters more than latency or cost.

Problem Statement

Large language models often produce confident but incorrect facts and make reasoning mistakes. Existing single-model prompting fixes help but still fail. The paper asks whether independent model instances that iteratively critique each other can reach more accurate, less hallucinatory answers without internal model access.

Main Contribution

Introduce a multiagent debate procedure: multiple LLM instances propose answers, see others' replies, critique, and update over rounds.

Show consistent accuracy gains on reasoning tasks (arithmetic, GSM8K, chess) and factual tasks (biographies, MMLU, chess validity) using black-box model calls and identical prompts.

Key Findings

Multiagent debate raises arithmetic accuracy from 67.0% to 81.8% on their test set.

NumbersArithmetic: 67.0%81.8% (Table 1)

Practical UseUse a 3-agent, 2-round debate to boost basic numeric accuracy when answers must be correct.

Evidence RefTable 1

Debate improves grade-school math (GSM8K) accuracy from 77.0% to 85.0% in zero-shot settings.

NumbersGSM8K: 77.0%85.0% (Table 1)

Practical UseApply debate alongside chain-of-thought prompts to increase solved math problems without model fine-tuning.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyMulti-Agent Debate 81.8%2.3)Single Agent 67.0%4.7)+14.8 pp100 synthetic arithmetic problemsTable 1 reports these mean ± std valuesTable 1
AccuracyMulti-Agent Debate 85.0%3.5)Single Agent 77.0%4.2)+8.0 pp100 GSM8K problemsTable 1 reports zero-shot resultsTable 1

What To Try In 7 Days

Run a 3-agent, 2-round debate using your existing LLM API and measure task accuracy vs single calls.

Combine debate with chain-of-thought prompts and compare solved/math accuracy on a small benchmark.

Summarize debate transcripts and fine-tune or distill them into training data to reduce inference cost later.

Agent Features

Memory
short-term debate context per roundoptional summarization of past rounds to reduce context
Planning
iterative rounds of critique and updateconsensus formation across agents
Tool Use
black-box API calls onlyno access to model internals (likelihoods/gradients)
Frameworks
prompt templates for initial and debate roundssummarization to compress many-agent replies
Is Agentic

Yes

Architectures
multi-instance LLMs (multiple copies of the same model)heterogeneous-agent debate (mixing models e.g., chatGPT + Bard)
Collaboration
agents read others' outputs and revisefinal answer emerges as consensus

Optimization Features

Token Efficiency
summarize many-agent replies to stay within context limits
System Optimization
trade off number of agents and rounds vs compute cost
Inference Optimization
response summarization to reduce context length

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Higher inference cost: requires multiple agents and rounds.

Debates can converge confidently to an incorrect consensus.

When Not To Use

Low-latency or resource-constrained environments where cost matters more than accuracy.

Tasks where quick single-step answers suffice.

Failure Modes

Agents unanimously settle on a wrong but internally consistent answer.

Models focus only on recent rounds and drop critical earlier corrections.

Core Entities

Models

gpt-3.5-turbo-0301chatGPTBard

Metrics

AccuracyStockfish pawn score (∆PS)Validity (%)Consensus rate

Datasets

GSM8KMMLUBIG-Bench chess-stateBiographies (524 computer scientists, introduced here)Synthetic arithmetic tasks

Benchmarks

Arithmetic tasksGSM8KMMLUChess move prediction (pawn score)Chess move validityBiographies factuality