Make LLMs argue: multi-model round-table + confidence-weighted voting improves reasoning

September 22, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

4

Authors

Justin Chih-Yao Chen, Swarnadeep Saha, Mohit Bansal

Links

Abstract / PDF

Why It Matters For Business

Combining multiple different LLMs in short, guided discussions yields consistent accuracy lifts on many reasoning tasks; this can improve product QA, decision support, and complex extraction when accuracy matters more than per-request cost.

Summary TLDR

RECONCILE runs a short multi-round ‘‘round-table’’ between different LLMs (e.g., ChatGPT, Bard, Claude2). Each agent provides an answer, a step-by-step explanation, and a confidence score. Agents see grouped answers, other agents' explanations, and a few human ‘‘convincing’’ examples. They then iteratively revise and a calibrated confidence-weighted vote picks the team answer. On seven reasoning benchmarks, RECONCILE raises accuracy vs single-agent and prior multi-agent baselines (up to +11.4pp on Date Understanding) and even beats GPT-4 on some commonsense tasks. Gains rely on model diversity, confidence calibration, and a few corrective examples. Cost: extra API calls and post-hoc confid‑s

Problem Statement

LLMs still make reasoning errors and self-refinement can stagnate when a single model repeats its own mistakes. The paper asks: can a small group of different LLMs discuss, convince each other, and reach a better consensus to improve reasoning?

Main Contribution

RECONCILE: a practical multi-model, multi-agent pipeline that runs multi-round discussions, elicits agent confidences, shows corrective human examples, and produces a calibrated confidence-weighted team vote.

Extensive evaluation on seven benchmarks showing consistent accuracy gains over strong single-agent and multi-agent baselines; in some commonsense tasks RECONCILE outperforms GPT-4.

Ablations that isolate why it helps: model diversity, confidence estimation, and answer-rectifying examples (convincing samples) are each useful.

Key Findings

RECONCILE boosts team accuracy on Date Understanding by a large margin versus a leading multi-agent debate baseline.

Numbers75.3 → 86.7 (+11.4pp)

On StrategyQA, a RECONCILE team of mixed LLMs outperforms zero-shot GPT-4.

NumbersGPT-4 75.6 → RECONCILE 79.0 (+3.4pp)

Diversity of model families matters: replacing multi-model agents with three instances of the same model drops accuracy.

NumbersMulti-model 79.0 → same-model 72.2 (−6.8pp)

Providing a few answer-rectifying human examples ('convincing samples') improves team performance.

NumbersWith convincing samples 79.0 → w/o convincing 74.5 (−4.5pp)

RECONCILE is not uniformly superior: some math benchmarks favor GPT-4.

NumbersGSM8K: GPT-4 90.7 → RECONCILE 85.3 (−5.4pp)

Results

Accuracy

Value79.0% (RECONCILE team: ChatGPT+Bard+Claude2)

BaselineGPT-4 zero-shot 75.6%

Accuracy

Value86.7% (RECONCILE)

BaselineMulti-agent debate baseline 75.3%

Accuracy

Value58.3% (RECONCILE with DeepSeekMath+Claude2+GPT-4)

BaselineBest single-agent zero-shot 50.5%

Accuracy

Value57.7% (RECONCILE)

BaselineMulti-agent debate 48.3%

Accuracy

Value85.3% (RECONCILE)

BaselineGPT-4 zero-shot 90.7%

Who Should Care

What To Try In 7 Days

Prototype a 2-round RECONCILE using two different LLM APIs and a small eval set (50–100 examples).

Add a prompt step that asks each model for a confidence number and use a simple calibrated weighted vote.

Collect 3–5 answer-rectifying examples for your domain and include them as 'convincing' demonstrations in prompts.

Agent Features

Memory

  • short-term discussion context across rounds

Planning

  • multi-round discussion (R rounds)
  • consensus termination when agents agree

Tool Use

  • API-based LLMs
  • local GPU inference for open-source agents

Frameworks

  • round-table multi-agent discussion
  • in-context learning with human corrective examples

Is Agentic

true

Architectures

  • Chat-based LLM APIs (GPT-family, PaLM, Claude)
  • Open-source LLMs (LLaMA-2-70B)
  • Domain-specialized small model (DeepSeekMath)

Collaboration

  • grouped-display of all agents' answers
  • convincing-sample demonstrations
  • confidence-weighted voting

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on black-box API models whose training data and internal calibration are unknown.
  • Extra rounds increase latency and API cost.
  • Confidence is elicited post-hoc via prompts and can be noisy without recalibration.
  • Convincing samples help but require human explanations or curated corrective examples.

When Not To Use

  • Real-time applications with tight latency budgets.
  • Environments where API cost or call limits are prohibitive.
  • Tasks already dominated by a single highly specialized model.

Failure Modes

  • Echo chambers when agent set lacks diversity (same-model agents repeat errors).
  • Overconfident agent can dominate weighted vote if calibration fails.
  • Poor or misleading convincing examples can steer group to wrong consensus.
  • Consensus may converge quickly to an incorrect majority opinion.

Core Entities

Models

  • ChatGPT (gpt-3.5-turbo-0613)
  • Bard (chat-bison-001)
  • Claude2
  • GPT-4
  • LLaMA-2-70B
  • DeepSeekMath

Metrics

  • Accuracy
  • BERTScore (for response diversity)
  • Expected Calibration Error (ECE)

Datasets

  • StrategyQA
  • CommonsenseQA
  • GSM8K
  • AQuA
  • MATH
  • Date Understanding (BIG-bench)
  • ANLI

Benchmarks

  • StrategyQA
  • CommonsenseQA
  • GSM8K
  • AQuA
  • MATH
  • Date Understanding
  • ANLI