Overview
Corex shows clear practical gains on many reasoning benchmarks and reduces token cost vs large-sample ensembles. Engineering is needed to orchestrate agents, handle context limits, and select modes per task.
Citations2
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Corex can boost accuracy on complex reasoning tasks while cutting inference token costs substantially; that reduces API bills and enables mixing cheaper open-source models with stronger ones for cost-effective pipelines.
Who Should Care
Summary TLDR
Corex turns many LLMs into a small team of autonomous agents that collaborate in three human-inspired ways—Discuss (group debate), Review (sequential peer review, including code), and Retrieve (pick the most faithful answer). Running 5 agents, Corex beats or matches strong baselines across 18 reasoning tasks (math, symbolic, commonsense, semi-structured), often with fewer token costs than large-sample majority-vote methods. Modes show distinct strengths: Discuss helps commonsense, Review fixes code/numerical errors, Retrieve selects faithful chains.
Problem Statement
Single LLMs often fail on multi-step, complex reasoning because their internal representations and single-pass outputs miss errors, hallucinate, or fail to self-correct. The paper asks: can small teams of LLMs collaborate to produce more factual, faithful, and cost-effective answers?
Main Contribution
Corex: a practical suite of multi-model collaboration strategies (Discuss, Review, Retrieve) that treat LLMs as autonomous agents.
Design details and prompts for three modes: group discussions with a judge, sequential peer review (including code repair), and a retriever that scores faithfulness between chains and answers.
Key Findings
Retrieve mode with 5 agents improves average math accuracy over strong self-consistency baseline.
Review mode that checks and repairs generated code yields big gains on symbolic tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 86.3 | CoT-SC(10) 84.6 | +1.7 pp | math benchmarks (Table 1 average) | Corex-Retrieve avg 86.3 vs CoT-SC(10) 84.6 | Table 1 |
| Accuracy | 91.1 | PAL/PoT 88.3 | +2.8 pp | BigBench symbolic tasks (Table 3 average) | Corex-Review Code avg 91.1 vs PAL 88.3 | Table 3 |
What To Try In 7 Days
Run a 5-agent Corex-Retrieve pipeline on a handful of your math-like QA examples to compare accuracy vs your current ensemble.
Add a lightweight Review stage (single reviewer) to any code-producing prompts to catch obvious bugs before execution.
Replace large-sample self-consistency runs with a small Corex workflow and measure token usage and error types for 100 queries.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Experiments mostly with commercial APIs; open-source model collaborations explored at smaller scale.
Context-length limits constrain discussion depth (noted for GPT-3.5-Turbo; only previous round stored).
When Not To Use
When you need a single low-latency model response with minimal orchestration overhead.
If you lack budget or API access to run multiple model calls in parallel.
Failure Modes
Strong models may 'monopolize' discussions and drown out diverse insights.
Reviewer chains can oscillate and occasionally worsen answers across review rounds.

