PRISM: jointly optimize Exploration, Information, and Aggregation for cheaper, more reliable multi-agent LLM reasoning

February 9, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

2

Authors

Yiming Yang, Zhuoyuan Li, Fanxiang Zeng, Hao Fu, Yue Liu

Links

Abstract / PDF

Why It Matters For Business

PRISM shows you can get higher accuracy and much better token-cost efficiency by coordinating small models with execution grounding and evidence-based synthesis, rather than only buying bigger models.

Summary TLDR

The paper introduces PRISM, a four-phase multi-agent workflow (Propose, Execute, Review, Synthesize) guided by a formal decomposition of multi-agent gains into Exploration (diverse candidates), Information (high-fidelity feedback), and Aggregation (evidence-based synthesis). PRISM uses role-based proposers, execution or pseudo-verification feedback, and iterative closed-loop synthesis. Across GSM8K, AIME-2025, MBPP, and BFCL-SP it outperforms existing multi-agent baselines and shows better compute-efficiency on token-cost vs accuracy Pareto fronts. The paper provides proofs that execution feedback is information-theoretically optimal for deterministic tasks and convergence guarantees for its

Problem Statement

Multi-agent LLM methods help reasoning but are heuristic and fragmented. Practitioners lack a principled view of why teams of LLMs help, which parts to optimize, and how to allocate compute. This paper provides a unified decomposition of multi-agent gains and a practical system (PRISM) that jointly optimizes the three identified dimensions to get sustained accuracy scaling and better compute-efficiency.

Main Contribution

A formal three-way decomposition of multi-agent gains: Exploration, Information, Aggregation.

PRISM: a four-phase multi-agent workflow (Propose, Execute, Review, Synthesize) that jointly optimizes all three dimensions.

Theoretical results: execution feedback is information-theoretically optimal for deterministic tasks and the synthesis loop has convergence guarantees.

Extensive empirical evaluation showing improved accuracy and token-efficiency across math, code, and function-calling benchmarks.

Key Findings

PRISM outperforms all tested multi-agent baselines on four benchmarks.

NumbersGSM8K +1.3pp; AIME +6.6pp; MBPP +6.6pp; BFCL-SP +3.5pp (vs runner-up)

Execution feedback yields maximal selection information for deterministic tasks.

NumbersI(Q;e)=H(Q) for executable tasks (Theorem 3.5a)

Role-based diversity reduces pairwise success correlation.

NumbersMeasured pairwise correlations on MBPP ≈ -0.12 to -0.18 (avg ≈ -0.15)

PRISM achieves strong token-efficiency on MBPP.

Numbers84.6% at ~1.54M tokens vs MoA 84.2% at ~7.73M tokens (~5× less cost)

Joint optimization shows subadditivity but outperforms single-dimension methods.

NumbersPRISM gain +8.6pp vs theoretical linear upper bound +9.8pp (synergy γ=0.88) on MBPP

Results

Accuracy

Value91.1%

BaselineRunner-up multi-agent 89.8%

Accuracy

Value93.3%

BaselineRunner-up multi-agent 86.7%

Accuracy

Value84.6%

BaselineRunner-up multi-agent 78.0%

Accuracy

Value92.3%

BaselineRunner-up multi-agent 88.8%

Token-efficiency (MBPP)

Value84.6% at ~1.54M tokens

BaselineMoA 84.2% at ~7.73M tokens

Who Should Care

What To Try In 7 Days

Run a small PRISM pilot: 3 role-based proposers (Minimalist, Skeptic, Explorer), 1 reviewer, 3 synthesis iterations on an executable task.

If you have test suites or validators, add execution grounding to selection instead of relying on self-critique.

Measure token cost vs accuracy and plot a simple Pareto curve to see if joint optimization beats scaling a single model.

Agent Features

Memory

  • closed-loop short-term validation (re-execute & refine)

Planning

  • iterative synthesis with closed-loop validation

Tool Use

  • sandboxed code execution
  • function calling/schema validation
  • LLM-based pseudo-verifier when execution not available

Frameworks

  • PRISM

Is Agentic

true

Architectures

  • role-based proposers + reviewer(s) + synthesizer pipeline
  • iterative closed-loop synthesis modeled as a potential game

Collaboration

  • evidence-based cross-evaluation (reviews)
  • role diversity to induce complementary behavior

Optimization Features

Token Efficiency

  • Accuracy
  • parallel propose/review phases to reduce wall-clock

Infra Optimization

  • use of parallel calls to lower wall-clock; token cost used as proxy

System Optimization

  • fully parallel proposer and reviewer calls
  • closed-loop re-execution to avoid brute-force stacking

Inference Optimization

  • test-time compute scaling via multi-agent orchestration
  • temperature=0 deterministic synthesis for logical rigor

Reproducibility

Data Urls

  • GSM8K
  • AIME-2025
  • MBPP
  • BFCL-SP

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Theoretical guarantees rely on deterministic execution or reviewer independence; these assumptions weaken for open-ended tasks without verifiers.
  • Experiments use one base model family, so gains may differ with heterogeneous ensembles.
  • AIME-2025 results are limited by small sample size (N=30) and wide confidence intervals.
  • Token cost is used as a proxy for compute; real infra costs and latency effects need separate analysis.

When Not To Use

  • Open-ended dialogue or creative text generation where no execution or reliable verifier exists (G_info ≈ 0).
  • When reviewer/verifier accuracy is too low and adding reviewers is expensive compared to getting ground-truth signals.

Failure Modes

  • Voting amplifies correlated errors when correct answers fragment or p_i < 0.5.
  • Verifier or reviewer biases can misclassify evidence, harming synthesis.
  • Synthesis agent itself can introduce new errors; closed-loop validation mitigates but may not eliminate this.

Core Entities

Models

  • Qwen3-30B-A3B-Instruct
  • Qwen3-235B-A22B-Instruct
  • DeepSeek-V3.2

Metrics

  • Accuracy
  • Pass@1
  • Token Cost

Datasets

  • GSM8K
  • AIME-2025
  • MBPP
  • BFCL-SP

Benchmarks

  • GSM8K
  • AIME-2025
  • MBPP
  • BFCL-SP

Context Entities

Models

  • Qwen330B-A3B-Instruct-2507 (used as base in experiments)

Metrics

  • Accuracy