Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.4
Citation Count
4
Why It Matters For Business
Combining multiple different LLMs in short, guided discussions yields consistent accuracy lifts on many reasoning tasks; this can improve product QA, decision support, and complex extraction when accuracy matters more than per-request cost.
Summary TLDR
RECONCILE runs a short multi-round ‘‘round-table’’ between different LLMs (e.g., ChatGPT, Bard, Claude2). Each agent provides an answer, a step-by-step explanation, and a confidence score. Agents see grouped answers, other agents' explanations, and a few human ‘‘convincing’’ examples. They then iteratively revise and a calibrated confidence-weighted vote picks the team answer. On seven reasoning benchmarks, RECONCILE raises accuracy vs single-agent and prior multi-agent baselines (up to +11.4pp on Date Understanding) and even beats GPT-4 on some commonsense tasks. Gains rely on model diversity, confidence calibration, and a few corrective examples. Cost: extra API calls and post-hoc confid‑s
Problem Statement
LLMs still make reasoning errors and self-refinement can stagnate when a single model repeats its own mistakes. The paper asks: can a small group of different LLMs discuss, convince each other, and reach a better consensus to improve reasoning?
Main Contribution
RECONCILE: a practical multi-model, multi-agent pipeline that runs multi-round discussions, elicits agent confidences, shows corrective human examples, and produces a calibrated confidence-weighted team vote.
Extensive evaluation on seven benchmarks showing consistent accuracy gains over strong single-agent and multi-agent baselines; in some commonsense tasks RECONCILE outperforms GPT-4.
Ablations that isolate why it helps: model diversity, confidence estimation, and answer-rectifying examples (convincing samples) are each useful.
Key Findings
RECONCILE boosts team accuracy on Date Understanding by a large margin versus a leading multi-agent debate baseline.
On StrategyQA, a RECONCILE team of mixed LLMs outperforms zero-shot GPT-4.
Diversity of model families matters: replacing multi-model agents with three instances of the same model drops accuracy.
Providing a few answer-rectifying human examples ('convincing samples') improves team performance.
RECONCILE is not uniformly superior: some math benchmarks favor GPT-4.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Prototype a 2-round RECONCILE using two different LLM APIs and a small eval set (50–100 examples).
Add a prompt step that asks each model for a confidence number and use a simple calibrated weighted vote.
Collect 3–5 answer-rectifying examples for your domain and include them as 'convincing' demonstrations in prompts.
Agent Features
Memory
- short-term discussion context across rounds
Planning
- multi-round discussion (R rounds)
- consensus termination when agents agree
Tool Use
- API-based LLMs
- local GPU inference for open-source agents
Frameworks
- round-table multi-agent discussion
- in-context learning with human corrective examples
Is Agentic
true
Architectures
- Chat-based LLM APIs (GPT-family, PaLM, Claude)
- Open-source LLMs (LLaMA-2-70B)
- Domain-specialized small model (DeepSeekMath)
Collaboration
- grouped-display of all agents' answers
- convincing-sample demonstrations
- confidence-weighted voting
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on black-box API models whose training data and internal calibration are unknown.
- Extra rounds increase latency and API cost.
- Confidence is elicited post-hoc via prompts and can be noisy without recalibration.
- Convincing samples help but require human explanations or curated corrective examples.
When Not To Use
- Real-time applications with tight latency budgets.
- Environments where API cost or call limits are prohibitive.
- Tasks already dominated by a single highly specialized model.
Failure Modes
- Echo chambers when agent set lacks diversity (same-model agents repeat errors).
- Overconfident agent can dominate weighted vote if calibration fails.
- Poor or misleading convincing examples can steer group to wrong consensus.
- Consensus may converge quickly to an incorrect majority opinion.
Core Entities
Models
- ChatGPT (gpt-3.5-turbo-0613)
- Bard (chat-bison-001)
- Claude2
- GPT-4
- LLaMA-2-70B
- DeepSeekMath
Metrics
- Accuracy
- BERTScore (for response diversity)
- Expected Calibration Error (ECE)
Datasets
- StrategyQA
- CommonsenseQA
- GSM8K
- AQuA
- MATH
- Date Understanding (BIG-bench)
- ANLI
Benchmarks
- StrategyQA
- CommonsenseQA
- GSM8K
- AQuA
- MATH
- Date Understanding
- ANLI

