Corex: make multiple LLM agents Discuss, Review and Retrieve to improve complex reasoning

September 30, 20238 min

Overview

Decision SnapshotNeeds Validation

Corex shows clear practical gains on many reasoning benchmarks and reduces token cost vs large-sample ensembles. Engineering is needed to orchestrate agents, handle context limits, and select modes per task.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Xipeng Qiu, Lingpeng Kong

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Corex can boost accuracy on complex reasoning tasks while cutting inference token costs substantially; that reduces API bills and enables mixing cheaper open-source models with stronger ones for cost-effective pipelines.

Who Should Care

Summary TLDR

Corex turns many LLMs into a small team of autonomous agents that collaborate in three human-inspired ways—Discuss (group debate), Review (sequential peer review, including code), and Retrieve (pick the most faithful answer). Running 5 agents, Corex beats or matches strong baselines across 18 reasoning tasks (math, symbolic, commonsense, semi-structured), often with fewer token costs than large-sample majority-vote methods. Modes show distinct strengths: Discuss helps commonsense, Review fixes code/numerical errors, Retrieve selects faithful chains.

Problem Statement

Single LLMs often fail on multi-step, complex reasoning because their internal representations and single-pass outputs miss errors, hallucinate, or fail to self-correct. The paper asks: can small teams of LLMs collaborate to produce more factual, faithful, and cost-effective answers?

Main Contribution

Corex: a practical suite of multi-model collaboration strategies (Discuss, Review, Retrieve) that treat LLMs as autonomous agents.

Design details and prompts for three modes: group discussions with a judge, sequential peer review (including code repair), and a retriever that scores faithfulness between chains and answers.

Key Findings

Retrieve mode with 5 agents improves average math accuracy over strong self-consistency baseline.

NumbersMath avg: Corex-Retrieve 86.3 vs CoT-SC(10) 84.6 (+1.7 pp)

Practical UseFor arithmetic/math tasks, run a small retriever agent to pick the most faithful chain instead of generating many samples for majority vote; you can get modest accuracy gains with fewer samples.

Evidence RefTable 1

Review mode that checks and repairs generated code yields big gains on symbolic tasks.

NumbersSymbolic avg: Corex-Review Code 91.1 vs PAL 88.3 (+2.8 pp)

Practical UseWhen problems involve programmatic steps or counting, add a sequential peer-review stage to catch bugs and misinterpretations before execution.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy86.3CoT-SC(10) 84.6+1.7 ppmath benchmarks (Table 1 average)Corex-Retrieve avg 86.3 vs CoT-SC(10) 84.6Table 1
Accuracy91.1PAL/PoT 88.3+2.8 ppBigBench symbolic tasks (Table 3 average)Corex-Review Code avg 91.1 vs PAL 88.3Table 3

What To Try In 7 Days

Run a 5-agent Corex-Retrieve pipeline on a handful of your math-like QA examples to compare accuracy vs your current ensemble.

Add a lightweight Review stage (single reviewer) to any code-producing prompts to catch obvious bugs before execution.

Replace large-sample self-consistency runs with a small Corex workflow and measure token usage and error types for 100 queries.

Agent Features

Memory
short-term round-limited memory (previous round only for GPT-3.5-Turbo experiments)
Planning
iterative group discussions (Discuss)sequential peer review (Review)
Tool Use
Python interpreter execution for generated code (PAL/ReviewCode)model-to-model prompts for scoring (Retrieve)
Frameworks
Corex orchestration scripts (GitHub)OpenAI and Anthropic APIs
Is Agentic

Yes

Architectures
LLM-based agents (chat/completion models)
Collaboration
group discussion with judgesequential review and repairretriever scoring of chain-answer faithfulness

Optimization Features

Token Efficiency
reported ~5–10% token cost vs majority-vote on some tasks
System Optimization
mixing weaker open-source models with stronger reviewers to reduce cost
Inference Optimization
small agent teams (5 agents) instead of large sample votingretriever selects faithful chains to avoid many costly samples

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

public datasets referenced (GSM8K, BIG-bench, FinQA, ConvFinQA, etc.)

Risks & Boundaries

Limitations

Experiments mostly with commercial APIs; open-source model collaborations explored at smaller scale.

Context-length limits constrain discussion depth (noted for GPT-3.5-Turbo; only previous round stored).

When Not To Use

When you need a single low-latency model response with minimal orchestration overhead.

If you lack budget or API access to run multiple model calls in parallel.

Failure Modes

Strong models may 'monopolize' discussions and drown out diverse insights.

Reviewer chains can oscillate and occasionally worsen answers across review rounds.

Core Entities

Models

GPT-3.5-Turbo-0613GPT-3.5-Turbo-16kGPT-4-0613Claude-Instant-1.2LLaMA-2-Chat(7B)LLaMA-2-Chat(13B)

Metrics

Accuracyofficial FinQA/ConvFinQA scripts

Datasets

GSM8KGSM-HardSVAMPMultiArithSingleOPSingleEQAddSubCommonsenseQAStrategyQAOpenBookQABoolQARC-cFinQAConvFinQATAT-QABIG-bench (Penguin, Date, Colored Objects, Repeat Copy, Object Counting)

Benchmarks

BIG-benchGSM-Hard