Corex: make multiple LLM agents Discuss, Review and Retrieve to improve complex reasoning

September 30, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

2

Authors

Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Xipeng Qiu, Lingpeng Kong

Links

Abstract / PDF

Why It Matters For Business

Corex can boost accuracy on complex reasoning tasks while cutting inference token costs substantially; that reduces API bills and enables mixing cheaper open-source models with stronger ones for cost-effective pipelines.

Summary TLDR

Corex turns many LLMs into a small team of autonomous agents that collaborate in three human-inspired ways—Discuss (group debate), Review (sequential peer review, including code), and Retrieve (pick the most faithful answer). Running 5 agents, Corex beats or matches strong baselines across 18 reasoning tasks (math, symbolic, commonsense, semi-structured), often with fewer token costs than large-sample majority-vote methods. Modes show distinct strengths: Discuss helps commonsense, Review fixes code/numerical errors, Retrieve selects faithful chains.

Problem Statement

Single LLMs often fail on multi-step, complex reasoning because their internal representations and single-pass outputs miss errors, hallucinate, or fail to self-correct. The paper asks: can small teams of LLMs collaborate to produce more factual, faithful, and cost-effective answers?

Main Contribution

Corex: a practical suite of multi-model collaboration strategies (Discuss, Review, Retrieve) that treat LLMs as autonomous agents.

Design details and prompts for three modes: group discussions with a judge, sequential peer review (including code repair), and a retriever that scores faithfulness between chains and answers.

Extensive evaluation on 18 datasets across four categories (math, symbolic, commonsense, semi-structured) showing consistent gains over CoT, self-consistency, PAL and recent multi-agent baselines.

Analysis of mode-specific strengths, synergy when combining modes, the effect of different model backbones, and a cost-effectiveness study showing major token savings versus heavy majority-vote ensembles.

Released code and data to reproduce experiments: https://github.com/QiushiSun/Corex.

Key Findings

Retrieve mode with 5 agents improves average math accuracy over strong self-consistency baseline.

NumbersMath avg: Corex-Retrieve 86.3 vs CoT-SC(10) 84.6 (+1.7 pp)

Review mode that checks and repairs generated code yields big gains on symbolic tasks.

NumbersSymbolic avg: Corex-Review Code 91.1 vs PAL 88.3 (+2.8 pp)

Discuss mode helps commonsense/factual tasks by improving rationale diversity and factuality.

NumbersStrategyQA: Corex-Discuss 68.4 vs CoT 65.3 (+3.1 pp)

Corex is more token-efficient than majority-vote ensembles and can match performance at much lower cost.

NumbersEquivalent performance with ~5–10% of token cost vs majority-vote methods on AddSub (Fig.11)

Different modes complement each other; combining modes usually improves over single modes.

NumbersCombining modes outperforms self-refine and single-mode runs (Figure 6)

Results

Accuracy

Value86.3

BaselineCoT-SC(10) 84.6

Accuracy

Value91.1

BaselinePAL/PoT 88.3

Accuracy

Value77.6

BaselineCoT-SC(10) 76.5

FinQA/ConvFinQA average

Value56.6

BaselineCoT-SC(10) 54.9

computational cost (tokens)

Value≈5–10% (to match certain baselines)

Baselinemajority-vote/self-consistency ensembles

Who Should Care

What To Try In 7 Days

Run a 5-agent Corex-Retrieve pipeline on a handful of your math-like QA examples to compare accuracy vs your current ensemble.

Add a lightweight Review stage (single reviewer) to any code-producing prompts to catch obvious bugs before execution.

Replace large-sample self-consistency runs with a small Corex workflow and measure token usage and error types for 100 queries.

Agent Features

Memory

  • short-term round-limited memory (previous round only for GPT-3.5-Turbo experiments)

Planning

  • iterative group discussions (Discuss)
  • sequential peer review (Review)

Tool Use

  • Python interpreter execution for generated code (PAL/ReviewCode)
  • model-to-model prompts for scoring (Retrieve)

Frameworks

  • Corex orchestration scripts (GitHub)
  • OpenAI and Anthropic APIs

Is Agentic

true

Architectures

  • LLM-based agents (chat/completion models)

Collaboration

  • group discussion with judge
  • sequential review and repair
  • retriever scoring of chain-answer faithfulness

Optimization Features

Token Efficiency

  • reported ~5–10% token cost vs majority-vote on some tasks

System Optimization

  • mixing weaker open-source models with stronger reviewers to reduce cost

Inference Optimization

  • small agent teams (5 agents) instead of large sample voting
  • retriever selects faithful chains to avoid many costly samples

Reproducibility

Data Urls

  • public datasets referenced (GSM8K, BIG-bench, FinQA, ConvFinQA, etc.)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments mostly with commercial APIs; open-source model collaborations explored at smaller scale.
  • Context-length limits constrain discussion depth (noted for GPT-3.5-Turbo; only previous round stored).
  • Instability can emerge when mixing models with very different capabilities; judge quality matters.
  • Results measured with specific LLMs and prompts; gains may vary with other models or production data.

When Not To Use

  • When you need a single low-latency model response with minimal orchestration overhead.
  • If you lack budget or API access to run multiple model calls in parallel.
  • When tasks are trivial and single-model CoT already saturates performance.

Failure Modes

  • Strong models may 'monopolize' discussions and drown out diverse insights.
  • Reviewer chains can oscillate and occasionally worsen answers across review rounds.
  • Generated code can still contain subtle bugs or misinterpretations even after reviews.
  • Retriever can favor confident but incorrect chains if candidate pool lacks correct reasoning.

Core Entities

Models

  • GPT-3.5-Turbo-0613
  • GPT-3.5-Turbo-16k
  • GPT-4-0613
  • Claude-Instant-1.2
  • LLaMA-2-Chat(7B)
  • LLaMA-2-Chat(13B)

Metrics

  • Accuracy
  • official FinQA/ConvFinQA scripts

Datasets

  • GSM8K
  • GSM-Hard
  • SVAMP
  • MultiArith
  • SingleOP
  • SingleEQ
  • AddSub
  • CommonsenseQA
  • StrategyQA
  • OpenBookQA
  • BoolQ
  • ARC-c
  • FinQA
  • ConvFinQA
  • TAT-QA
  • BIG-bench (Penguin, Date, Colored Objects, Repeat Copy, Object Counting)

Benchmarks

  • BIG-bench
  • GSM-Hard