Corex: make multiple LLM agents Discuss, Review and Retrieve to improve complex reasoning

Overview

Decision SnapshotNeeds Validation

Corex shows clear practical gains on many reasoning benchmarks and reduces token cost vs large-sample ensembles. Engineering is needed to orchestrate agents, handle context limits, and select modes per task.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Xipeng Qiu, Lingpeng Kong

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Corex can boost accuracy on complex reasoning tasks while cutting inference token costs substantially; that reduces API bills and enables mixing cheaper open-source models with stronger ones for cost-effective pipelines.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

Corex turns many LLMs into a small team of autonomous agents that collaborate in three human-inspired ways—Discuss (group debate), Review (sequential peer review, including code), and Retrieve (pick the most faithful answer). Running 5 agents, Corex beats or matches strong baselines across 18 reasoning tasks (math, symbolic, commonsense, semi-structured), often with fewer token costs than large-sample majority-vote methods. Modes show distinct strengths: Discuss helps commonsense, Review fixes code/numerical errors, Retrieve selects faithful chains.

Problem Statement

Single LLMs often fail on multi-step, complex reasoning because their internal representations and single-pass outputs miss errors, hallucinate, or fail to self-correct. The paper asks: can small teams of LLMs collaborate to produce more factual, faithful, and cost-effective answers?

Main Contribution

Corex: a practical suite of multi-model collaboration strategies (Discuss, Review, Retrieve) that treat LLMs as autonomous agents.

Design details and prompts for three modes: group discussions with a judge, sequential peer review (including code repair), and a retriever that scores faithfulness between chains and answers.

Key Findings

Retrieve mode with 5 agents improves average math accuracy over strong self-consistency baseline.

NumbersMath avg: Corex-Retrieve 86.3 vs CoT-SC(10) 84.6 (+1.7 pp)

Practical UseFor arithmetic/math tasks, run a small retriever agent to pick the most faithful chain instead of generating many samples for majority vote; you can get modest accuracy gains with fewer samples.

Evidence RefTable 1

Review mode that checks and repairs generated code yields big gains on symbolic tasks.

NumbersSymbolic avg: Corex-Review Code 91.1 vs PAL 88.3 (+2.8 pp)

Practical UseWhen problems involve programmatic steps or counting, add a sequential peer-review stage to catch bugs and misinterpretations before execution.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	86.3	CoT-SC(10) 84.6	+1.7 pp	math benchmarks (Table 1 average)	Corex-Retrieve avg 86.3 vs CoT-SC(10) 84.6	Table 1
Accuracy	91.1	PAL/PoT 88.3	+2.8 pp	BigBench symbolic tasks (Table 3 average)	Corex-Review Code avg 91.1 vs PAL 88.3	Table 3

What To Try In 7 Days

Run a 5-agent Corex-Retrieve pipeline on a handful of your math-like QA examples to compare accuracy vs your current ensemble.

Add a lightweight Review stage (single reviewer) to any code-producing prompts to catch obvious bugs before execution.

Replace large-sample self-consistency runs with a small Corex workflow and measure token usage and error types for 100 queries.

Agent Features

Memory

short-term round-limited memory (previous round only for GPT-3.5-Turbo experiments)

Planning

iterative group discussions (Discuss)sequential peer review (Review)

Tool Use

Python interpreter execution for generated code (PAL/ReviewCode)model-to-model prompts for scoring (Retrieve)

Frameworks

Corex orchestration scripts (GitHub)OpenAI and Anthropic APIs

Is Agentic

Yes

Architectures

LLM-based agents (chat/completion models)

Collaboration

group discussion with judgesequential review and repairretriever scoring of chain-answer faithfulness

Optimization Features

Token Efficiency

reported ~5–10% token cost vs majority-vote on some tasks

System Optimization

mixing weaker open-source models with stronger reviewers to reduce cost

Inference Optimization

small agent teams (5 agents) instead of large sample votingretriever selects faithful chains to avoid many costly samples

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/QiushiSun/Corex

Data URLs

public datasets referenced (GSM8K, BIG-bench, FinQA, ConvFinQA, etc.)

Risks & Boundaries

Limitations

Experiments mostly with commercial APIs; open-source model collaborations explored at smaller scale.

Context-length limits constrain discussion depth (noted for GPT-3.5-Turbo; only previous round stored).

When Not To Use

When you need a single low-latency model response with minimal orchestration overhead.

If you lack budget or API access to run multiple model calls in parallel.

Failure Modes

Strong models may 'monopolize' discussions and drown out diverse insights.

Reviewer chains can oscillate and occasionally worsen answers across review rounds.

Core Entities

Models

GPT-3.5-Turbo-0613GPT-3.5-Turbo-16kGPT-4-0613Claude-Instant-1.2LLaMA-2-Chat(7B)LLaMA-2-Chat(13B)

Metrics

Accuracyofficial FinQA/ConvFinQA scripts

Datasets

GSM8KGSM-HardSVAMPMultiArithSingleOPSingleEQAddSubCommonsenseQAStrategyQAOpenBookQABoolQARC-cFinQAConvFinQATAT-QABIG-bench (Penguin, Date, Colored Objects, Repeat Copy, Object Counting)

Benchmarks

BIG-benchGSM-Hard

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Retrieve mode with 5 agents improves average math accuracy over strong self-consistency baseline.

Review mode that checks and repairs generated code yields big gains on symbolic tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding