Make LLMs argue: multi-model round-table + confidence-weighted voting improves reasoning

Overview

Decision SnapshotNeeds Validation

The method is practical and reproducible with available LLM APIs; it reliably improves accuracy on many benchmarks but adds API cost and depends on post-hoc confidence elicitation and some human examples.

Citations4

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 70%

Authors

Justin Chih-Yao Chen, Swarnadeep Saha, Mohit Bansal

Links

Abstract / PDF / Code

Why It Matters For Business

Combining multiple different LLMs in short, guided discussions yields consistent accuracy lifts on many reasoning tasks; this can improve product QA, decision support, and complex extraction when accuracy matters more than per-request cost.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

RECONCILE runs a short multi-round ‘‘round-table’’ between different LLMs (e.g., ChatGPT, Bard, Claude2). Each agent provides an answer, a step-by-step explanation, and a confidence score. Agents see grouped answers, other agents' explanations, and a few human ‘‘convincing’’ examples. They then iteratively revise and a calibrated confidence-weighted vote picks the team answer. On seven reasoning benchmarks, RECONCILE raises accuracy vs single-agent and prior multi-agent baselines (up to +11.4pp on Date Understanding) and even beats GPT-4 on some commonsense tasks. Gains rely on model diversity, confidence calibration, and a few corrective examples. Cost: extra API calls and post-hoc confid‑s

Problem Statement

LLMs still make reasoning errors and self-refinement can stagnate when a single model repeats its own mistakes. The paper asks: can a small group of different LLMs discuss, convince each other, and reach a better consensus to improve reasoning?

Main Contribution

RECONCILE: a practical multi-model, multi-agent pipeline that runs multi-round discussions, elicits agent confidences, shows corrective human examples, and produces a calibrated confidence-weighted team vote.

Extensive evaluation on seven benchmarks showing consistent accuracy gains over strong single-agent and multi-agent baselines; in some commonsense tasks RECONCILE outperforms GPT-4.

Key Findings

RECONCILE boosts team accuracy on Date Understanding by a large margin versus a leading multi-agent debate baseline.

Numbers75.3 → 86.7 (+11.4pp)

Practical UseUse a small multi-model discussion loop to substantially improve accuracy on some reasoning tasks.

Evidence RefTable 2; §6.1

On StrategyQA, a RECONCILE team of mixed LLMs outperforms zero-shot GPT-4.

NumbersGPT-4 75.6 → RECONCILE 79.0 (+3.4pp)

Practical UseCombining diverse models can beat a stronger single model on some commonsense reasoning workloads.

Evidence RefTable 2; §6.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	79.0% (RECONCILE team: ChatGPT+Bard+Claude2)	GPT-4 zero-shot 75.6%	+3.4pp	StrategyQA (100-sample subset)	Table 2; §6.1	Table 2; §6.1
Accuracy	86.7% (RECONCILE)	Multi-agent debate baseline 75.3%	+11.4pp	Date Understanding (100-sample subset)	Table 2; §6.1	Table 2; §6.1

What To Try In 7 Days

Prototype a 2-round RECONCILE using two different LLM APIs and a small eval set (50–100 examples).

Add a prompt step that asks each model for a confidence number and use a simple calibrated weighted vote.

Collect 3–5 answer-rectifying examples for your domain and include them as 'convincing' demonstrations in prompts.

Agent Features

Memory

short-term discussion context across rounds

Planning

multi-round discussion (R rounds)consensus termination when agents agree

Tool Use

API-based LLMslocal GPU inference for open-source agents

Frameworks

round-table multi-agent discussionin-context learning with human corrective examples

Is Agentic

Yes

Architectures

Chat-based LLM APIs (GPT-family, PaLM, Claude)Open-source LLMs (LLaMA-2-70B)Domain-specialized small model (DeepSeekMath)

Collaboration

grouped-display of all agents' answersconvincing-sample demonstrationsconfidence-weighted voting

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/dinobby/ReConcile

Risks & Boundaries

Limitations

Relies on black-box API models whose training data and internal calibration are unknown.

Extra rounds increase latency and API cost.

When Not To Use

Real-time applications with tight latency budgets.

Environments where API cost or call limits are prohibitive.

Failure Modes

Echo chambers when agent set lacks diversity (same-model agents repeat errors).

Overconfident agent can dominate weighted vote if calibration fails.

Core Entities

Models

ChatGPT (gpt-3.5-turbo-0613)Bard (chat-bison-001)Claude2GPT-4LLaMA-2-70BDeepSeekMath

Metrics

AccuracyBERTScore (for response diversity)Expected Calibration Error (ECE)

Datasets

StrategyQACommonsenseQAGSM8KAQuAMATHDate Understanding (BIG-bench)ANLI

Benchmarks

StrategyQACommonsenseQAGSM8KAQuAMATHDate UnderstandingANLI

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RECONCILE boosts team accuracy on Date Understanding by a large margin versus a leading multi-agent debate baseline.

On StrategyQA, a RECONCILE team of mixed LLMs outperforms zero-shot GPT-4.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding