Make LLMs argue: multi-model round-table + confidence-weighted voting improves reasoning

September 22, 20237 min

Overview

Decision SnapshotNeeds Validation

The method is practical and reproducible with available LLM APIs; it reliably improves accuracy on many benchmarks but adds API cost and depends on post-hoc confidence elicitation and some human examples.

Citations4

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 70%

Authors

Justin Chih-Yao Chen, Swarnadeep Saha, Mohit Bansal

Links

Abstract / PDF / Code

Why It Matters For Business

Combining multiple different LLMs in short, guided discussions yields consistent accuracy lifts on many reasoning tasks; this can improve product QA, decision support, and complex extraction when accuracy matters more than per-request cost.

Who Should Care

Summary TLDR

RECONCILE runs a short multi-round ‘‘round-table’’ between different LLMs (e.g., ChatGPT, Bard, Claude2). Each agent provides an answer, a step-by-step explanation, and a confidence score. Agents see grouped answers, other agents' explanations, and a few human ‘‘convincing’’ examples. They then iteratively revise and a calibrated confidence-weighted vote picks the team answer. On seven reasoning benchmarks, RECONCILE raises accuracy vs single-agent and prior multi-agent baselines (up to +11.4pp on Date Understanding) and even beats GPT-4 on some commonsense tasks. Gains rely on model diversity, confidence calibration, and a few corrective examples. Cost: extra API calls and post-hoc confid‑s

Problem Statement

LLMs still make reasoning errors and self-refinement can stagnate when a single model repeats its own mistakes. The paper asks: can a small group of different LLMs discuss, convince each other, and reach a better consensus to improve reasoning?

Main Contribution

RECONCILE: a practical multi-model, multi-agent pipeline that runs multi-round discussions, elicits agent confidences, shows corrective human examples, and produces a calibrated confidence-weighted team vote.

Extensive evaluation on seven benchmarks showing consistent accuracy gains over strong single-agent and multi-agent baselines; in some commonsense tasks RECONCILE outperforms GPT-4.

Key Findings

RECONCILE boosts team accuracy on Date Understanding by a large margin versus a leading multi-agent debate baseline.

Numbers75.386.7 (+11.4pp)

Practical UseUse a small multi-model discussion loop to substantially improve accuracy on some reasoning tasks.

Evidence RefTable 2; §6.1

On StrategyQA, a RECONCILE team of mixed LLMs outperforms zero-shot GPT-4.

NumbersGPT-4 75.6 → RECONCILE 79.0 (+3.4pp)

Practical UseCombining diverse models can beat a stronger single model on some commonsense reasoning workloads.

Evidence RefTable 2; §6.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy79.0% (RECONCILE team: ChatGPT+Bard+Claude2)GPT-4 zero-shot 75.6%+3.4ppStrategyQA (100-sample subset)Table 2; §6.1Table 2; §6.1
Accuracy86.7% (RECONCILE)Multi-agent debate baseline 75.3%+11.4ppDate Understanding (100-sample subset)Table 2; §6.1Table 2; §6.1

What To Try In 7 Days

Prototype a 2-round RECONCILE using two different LLM APIs and a small eval set (50–100 examples).

Add a prompt step that asks each model for a confidence number and use a simple calibrated weighted vote.

Collect 3–5 answer-rectifying examples for your domain and include them as 'convincing' demonstrations in prompts.

Agent Features

Memory
short-term discussion context across rounds
Planning
multi-round discussion (R rounds)consensus termination when agents agree
Tool Use
API-based LLMslocal GPU inference for open-source agents
Frameworks
round-table multi-agent discussionin-context learning with human corrective examples
Is Agentic

Yes

Architectures
Chat-based LLM APIs (GPT-family, PaLM, Claude)Open-source LLMs (LLaMA-2-70B)Domain-specialized small model (DeepSeekMath)
Collaboration
grouped-display of all agents' answersconvincing-sample demonstrationsconfidence-weighted voting

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Relies on black-box API models whose training data and internal calibration are unknown.

Extra rounds increase latency and API cost.

When Not To Use

Real-time applications with tight latency budgets.

Environments where API cost or call limits are prohibitive.

Failure Modes

Echo chambers when agent set lacks diversity (same-model agents repeat errors).

Overconfident agent can dominate weighted vote if calibration fails.

Core Entities

Models

ChatGPT (gpt-3.5-turbo-0613)Bard (chat-bison-001)Claude2GPT-4LLaMA-2-70BDeepSeekMath

Metrics

AccuracyBERTScore (for response diversity)Expected Calibration Error (ECE)

Datasets

StrategyQACommonsenseQAGSM8KAQuAMATHDate Understanding (BIG-bench)ANLI

Benchmarks

StrategyQACommonsenseQAGSM8KAQuAMATHDate UnderstandingANLI