Let multiple copies of an LLM debate to improve reasoning and reduce hallucinations

May 23, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

85

Authors

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch

Links

Abstract / PDF

Why It Matters For Business

If accuracy matters more than latency, running several LLM copies that debate can materially reduce wrong answers and hallucinations, producing higher-quality outputs for QA, math, and plan generation.

Summary TLDR

Run several copies of a language model as independent "agents" that propose answers, read and critique each other's outputs, and iterate for a few rounds. On math, chess, and factual tasks this multiagent debate consistently improves final accuracy over single-model generation or simple reflection. Gains are shown with black-box access to chatGPT-style models and with the same prompts across tasks. The method costs more compute and sometimes converges confidently to a wrong consensus, so use it when accuracy matters more than latency or cost.

Problem Statement

Large language models often produce confident but incorrect facts and make reasoning mistakes. Existing single-model prompting fixes help but still fail. The paper asks whether independent model instances that iteratively critique each other can reach more accurate, less hallucinatory answers without internal model access.

Main Contribution

Introduce a multiagent debate procedure: multiple LLM instances propose answers, see others' replies, critique, and update over rounds.

Show consistent accuracy gains on reasoning tasks (arithmetic, GSM8K, chess) and factual tasks (biographies, MMLU, chess validity) using black-box model calls and identical prompts.

Provide an evaluation dataset of 524 computer scientist biographies to measure factual hallucinations and analyze debate dynamics, agent counts, rounds, and prompts.

Key Findings

Multiagent debate raises arithmetic accuracy from 67.0% to 81.8% on their test set.

NumbersArithmetic: 67.0% → 81.8% (Table 1)

Debate improves grade-school math (GSM8K) accuracy from 77.0% to 85.0% in zero-shot settings.

NumbersGSM8K: 77.0% → 85.0% (Table 1)

On factual tasks, debate raised biography agreement from 66.0% to 73.8% and MMLU from 63.9% to 71.1%.

NumbersBiographies: 66.0% → 73.8%; MMLU: 63.9% → 71.1% (Table 2)

Chess move quality improved materially: predicted-move pawn score rose from 91.4 to 122.9 and move validity from 29.3% to 45.2%.

NumbersChess PS: 91.4 → 122.9; Validity: 29.3% → 45.2% (Tables 1,2)

More agents and more rounds generally improve performance, with diminishing returns beyond ~4 rounds.

NumbersMonotonic gains with agents/rounds; plateaus after ~4 rounds (Figures 10b,10a)

Debate works with black-box APIs using identical prompts across tasks and benefits from chain-of-thought prompting.

NumbersGains observed using gpt-3.5-turbo and mixed-model debate (chatGPT + Bard) (Figures 6,11)

Debates sometimes converge to confidently wrong answers and long debates can force models to ignore older context.

NumbersAuthors report confident but incorrect consensus and reduced attention to earlier debate content (Section 5)

Results

Accuracy

ValueMulti-Agent Debate 81.8% (±2.3)

BaselineSingle Agent 67.0% (±4.7)

Accuracy

ValueMulti-Agent Debate 85.0% (±3.5)

BaselineSingle Agent 77.0% (±4.2)

Chess predicted-move pawn score (Stockfish ∆PS)

ValueMulti-Agent Debate 122.9 (±7.6)

BaselineSingle Agent 91.4 (±10.6)

Biographies factual agreement

ValueMulti-Agent Debate 73.8% (±2.3)

BaselineSingle Agent 66.0% (±2.2)

Accuracy

ValueMulti-Agent Debate 71.1% (±4.6)

BaselineSingle Agent 63.9% (±4.8)

Chess move validity

ValueMulti-Agent Debate 45.2% (±2.9)

BaselineSingle Agent 29.3% (±2.6)

Who Should Care

What To Try In 7 Days

Run a 3-agent, 2-round debate using your existing LLM API and measure task accuracy vs single calls.

Combine debate with chain-of-thought prompts and compare solved/math accuracy on a small benchmark.

Summarize debate transcripts and fine-tune or distill them into training data to reduce inference cost later.

Agent Features

Memory

  • short-term debate context per round
  • optional summarization of past rounds to reduce context

Planning

  • iterative rounds of critique and update
  • consensus formation across agents

Tool Use

  • black-box API calls only
  • no access to model internals (likelihoods/gradients)

Frameworks

  • prompt templates for initial and debate rounds
  • summarization to compress many-agent replies

Is Agentic

true

Architectures

  • multi-instance LLMs (multiple copies of the same model)
  • heterogeneous-agent debate (mixing models e.g., chatGPT + Bard)

Collaboration

  • agents read others' outputs and revise
  • final answer emerges as consensus

Optimization Features

Token Efficiency

  • summarize many-agent replies to stay within context limits

System Optimization

  • trade off number of agents and rounds vs compute cost

Inference Optimization

  • response summarization to reduce context length

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Higher inference cost: requires multiple agents and rounds.
  • Debates can converge confidently to an incorrect consensus.
  • Long debates risk models ignoring early context unless you summarize.
  • Results are reported mainly with gpt-3.5-turbo; generality to all models is not fully proven.

When Not To Use

  • Low-latency or resource-constrained environments where cost matters more than accuracy.
  • Tasks where quick single-step answers suffice.
  • Cases where final answers cannot be audited and mistaken consensus is unacceptable.

Failure Modes

  • Agents unanimously settle on a wrong but internally consistent answer.
  • Models focus only on recent rounds and drop critical earlier corrections.
  • Debate amplifies shared internal biases instead of correcting errors.

Core Entities

Models

  • gpt-3.5-turbo-0301
  • chatGPT
  • Bard

Metrics

  • Accuracy
  • Stockfish pawn score (∆PS)
  • Validity (%)
  • Consensus rate

Datasets

  • GSM8K
  • MMLU
  • BIG-Bench chess-state
  • Biographies (524 computer scientists, introduced here)
  • Synthetic arithmetic tasks

Benchmarks

  • Arithmetic tasks
  • GSM8K
  • MMLU
  • Chess move prediction (pawn score)
  • Chess move validity
  • Biographies factuality