Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
Corex can boost accuracy on complex reasoning tasks while cutting inference token costs substantially; that reduces API bills and enables mixing cheaper open-source models with stronger ones for cost-effective pipelines.
Summary TLDR
Corex turns many LLMs into a small team of autonomous agents that collaborate in three human-inspired ways—Discuss (group debate), Review (sequential peer review, including code), and Retrieve (pick the most faithful answer). Running 5 agents, Corex beats or matches strong baselines across 18 reasoning tasks (math, symbolic, commonsense, semi-structured), often with fewer token costs than large-sample majority-vote methods. Modes show distinct strengths: Discuss helps commonsense, Review fixes code/numerical errors, Retrieve selects faithful chains.
Problem Statement
Single LLMs often fail on multi-step, complex reasoning because their internal representations and single-pass outputs miss errors, hallucinate, or fail to self-correct. The paper asks: can small teams of LLMs collaborate to produce more factual, faithful, and cost-effective answers?
Main Contribution
Corex: a practical suite of multi-model collaboration strategies (Discuss, Review, Retrieve) that treat LLMs as autonomous agents.
Design details and prompts for three modes: group discussions with a judge, sequential peer review (including code repair), and a retriever that scores faithfulness between chains and answers.
Extensive evaluation on 18 datasets across four categories (math, symbolic, commonsense, semi-structured) showing consistent gains over CoT, self-consistency, PAL and recent multi-agent baselines.
Analysis of mode-specific strengths, synergy when combining modes, the effect of different model backbones, and a cost-effectiveness study showing major token savings versus heavy majority-vote ensembles.
Released code and data to reproduce experiments: https://github.com/QiushiSun/Corex.
Key Findings
Retrieve mode with 5 agents improves average math accuracy over strong self-consistency baseline.
Review mode that checks and repairs generated code yields big gains on symbolic tasks.
Discuss mode helps commonsense/factual tasks by improving rationale diversity and factuality.
Corex is more token-efficient than majority-vote ensembles and can match performance at much lower cost.
Different modes complement each other; combining modes usually improves over single modes.
Results
Accuracy
Accuracy
Accuracy
FinQA/ConvFinQA average
computational cost (tokens)
Who Should Care
What To Try In 7 Days
Run a 5-agent Corex-Retrieve pipeline on a handful of your math-like QA examples to compare accuracy vs your current ensemble.
Add a lightweight Review stage (single reviewer) to any code-producing prompts to catch obvious bugs before execution.
Replace large-sample self-consistency runs with a small Corex workflow and measure token usage and error types for 100 queries.
Agent Features
Memory
- short-term round-limited memory (previous round only for GPT-3.5-Turbo experiments)
Planning
- iterative group discussions (Discuss)
- sequential peer review (Review)
Tool Use
- Python interpreter execution for generated code (PAL/ReviewCode)
- model-to-model prompts for scoring (Retrieve)
Frameworks
- Corex orchestration scripts (GitHub)
- OpenAI and Anthropic APIs
Is Agentic
true
Architectures
- LLM-based agents (chat/completion models)
Collaboration
- group discussion with judge
- sequential review and repair
- retriever scoring of chain-answer faithfulness
Optimization Features
Token Efficiency
- reported ~5–10% token cost vs majority-vote on some tasks
System Optimization
- mixing weaker open-source models with stronger reviewers to reduce cost
Inference Optimization
- small agent teams (5 agents) instead of large sample voting
- retriever selects faithful chains to avoid many costly samples
Reproducibility
Code Urls
Data Urls
- public datasets referenced (GSM8K, BIG-bench, FinQA, ConvFinQA, etc.)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments mostly with commercial APIs; open-source model collaborations explored at smaller scale.
- Context-length limits constrain discussion depth (noted for GPT-3.5-Turbo; only previous round stored).
- Instability can emerge when mixing models with very different capabilities; judge quality matters.
- Results measured with specific LLMs and prompts; gains may vary with other models or production data.
When Not To Use
- When you need a single low-latency model response with minimal orchestration overhead.
- If you lack budget or API access to run multiple model calls in parallel.
- When tasks are trivial and single-model CoT already saturates performance.
Failure Modes
- Strong models may 'monopolize' discussions and drown out diverse insights.
- Reviewer chains can oscillate and occasionally worsen answers across review rounds.
- Generated code can still contain subtle bugs or misinterpretations even after reviews.
- Retriever can favor confident but incorrect chains if candidate pool lacks correct reasoning.
Core Entities
Models
- GPT-3.5-Turbo-0613
- GPT-3.5-Turbo-16k
- GPT-4-0613
- Claude-Instant-1.2
- LLaMA-2-Chat(7B)
- LLaMA-2-Chat(13B)
Metrics
- Accuracy
- official FinQA/ConvFinQA scripts
Datasets
- GSM8K
- GSM-Hard
- SVAMP
- MultiArith
- SingleOP
- SingleEQ
- AddSub
- CommonsenseQA
- StrategyQA
- OpenBookQA
- BoolQ
- ARC-c
- FinQA
- ConvFinQA
- TAT-QA
- BIG-bench (Penguin, Date, Colored Objects, Repeat Copy, Object Counting)
Benchmarks
- BIG-bench
- GSM-Hard

