Overview
The method is practical and reproducible with available LLM APIs; it reliably improves accuracy on many benchmarks but adds API cost and depends on post-hoc confidence elicitation and some human examples.
Citations4
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Combining multiple different LLMs in short, guided discussions yields consistent accuracy lifts on many reasoning tasks; this can improve product QA, decision support, and complex extraction when accuracy matters more than per-request cost.
Who Should Care
Summary TLDR
RECONCILE runs a short multi-round ‘‘round-table’’ between different LLMs (e.g., ChatGPT, Bard, Claude2). Each agent provides an answer, a step-by-step explanation, and a confidence score. Agents see grouped answers, other agents' explanations, and a few human ‘‘convincing’’ examples. They then iteratively revise and a calibrated confidence-weighted vote picks the team answer. On seven reasoning benchmarks, RECONCILE raises accuracy vs single-agent and prior multi-agent baselines (up to +11.4pp on Date Understanding) and even beats GPT-4 on some commonsense tasks. Gains rely on model diversity, confidence calibration, and a few corrective examples. Cost: extra API calls and post-hoc confid‑s
Problem Statement
LLMs still make reasoning errors and self-refinement can stagnate when a single model repeats its own mistakes. The paper asks: can a small group of different LLMs discuss, convince each other, and reach a better consensus to improve reasoning?
Main Contribution
RECONCILE: a practical multi-model, multi-agent pipeline that runs multi-round discussions, elicits agent confidences, shows corrective human examples, and produces a calibrated confidence-weighted team vote.
Extensive evaluation on seven benchmarks showing consistent accuracy gains over strong single-agent and multi-agent baselines; in some commonsense tasks RECONCILE outperforms GPT-4.
Key Findings
RECONCILE boosts team accuracy on Date Understanding by a large margin versus a leading multi-agent debate baseline.
On StrategyQA, a RECONCILE team of mixed LLMs outperforms zero-shot GPT-4.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 79.0% (RECONCILE team: ChatGPT+Bard+Claude2) | GPT-4 zero-shot 75.6% | +3.4pp | StrategyQA (100-sample subset) | Table 2; §6.1 | Table 2; §6.1 |
| Accuracy | 86.7% (RECONCILE) | Multi-agent debate baseline 75.3% | +11.4pp | Date Understanding (100-sample subset) | Table 2; §6.1 | Table 2; §6.1 |
What To Try In 7 Days
Prototype a 2-round RECONCILE using two different LLM APIs and a small eval set (50–100 examples).
Add a prompt step that asks each model for a confidence number and use a simple calibrated weighted vote.
Collect 3–5 answer-rectifying examples for your domain and include them as 'convincing' demonstrations in prompts.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Relies on black-box API models whose training data and internal calibration are unknown.
Extra rounds increase latency and API cost.
When Not To Use
Real-time applications with tight latency budgets.
Environments where API cost or call limits are prohibitive.
Failure Modes
Echo chambers when agent set lacks diversity (same-model agents repeat errors).
Overconfident agent can dominate weighted vote if calibration fails.

