PRISM: jointly optimize Exploration, Information, and Aggregation for cheaper, more reliable multi-agent LLM reasoning

February 9, 20267 min

Overview

Decision SnapshotReady For Pilot

The paper gives formal proofs for the decomposition and convergence in deterministic settings and supports claims with multi-benchmark experiments and ablations, but some guarantees weaken for non-executable tasks and the study uses one model family.

Citations2

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 70%

Authors

Yiming Yang, Zhuoyuan Li, Fanxiang Zeng, Hao Fu, Yue Liu

Links

Abstract / PDF / Data

Why It Matters For Business

PRISM shows you can get higher accuracy and much better token-cost efficiency by coordinating small models with execution grounding and evidence-based synthesis, rather than only buying bigger models.

Who Should Care

Summary TLDR

The paper introduces PRISM, a four-phase multi-agent workflow (Propose, Execute, Review, Synthesize) guided by a formal decomposition of multi-agent gains into Exploration (diverse candidates), Information (high-fidelity feedback), and Aggregation (evidence-based synthesis). PRISM uses role-based proposers, execution or pseudo-verification feedback, and iterative closed-loop synthesis. Across GSM8K, AIME-2025, MBPP, and BFCL-SP it outperforms existing multi-agent baselines and shows better compute-efficiency on token-cost vs accuracy Pareto fronts. The paper provides proofs that execution feedback is information-theoretically optimal for deterministic tasks and convergence guarantees for its

Problem Statement

Multi-agent LLM methods help reasoning but are heuristic and fragmented. Practitioners lack a principled view of why teams of LLMs help, which parts to optimize, and how to allocate compute. This paper provides a unified decomposition of multi-agent gains and a practical system (PRISM) that jointly optimizes the three identified dimensions to get sustained accuracy scaling and better compute-efficiency.

Main Contribution

A formal three-way decomposition of multi-agent gains: Exploration, Information, Aggregation.

PRISM: a four-phase multi-agent workflow (Propose, Execute, Review, Synthesize) that jointly optimizes all three dimensions.

Key Findings

PRISM outperforms all tested multi-agent baselines on four benchmarks.

NumbersGSM8K +1.3pp; AIME +6.6pp; MBPP +6.6pp; BFCL-SP +3.5pp (vs runner-up)

Practical UseUse PRISM-style joint optimization to gain several percentage points over current multi-agent methods on math, code, and function-calling tasks.

Evidence RefMain results Table 2

Execution feedback yields maximal selection information for deterministic tasks.

NumbersI(Q;e)=H(Q) for executable tasks (Theorem 3.5a)

Practical UseWhen you can run or validate outputs (tests, schema checks), prioritize execution grounding—it gives the best signal for choosing correct solutions.

Evidence RefTheorem 3.5a, Remark 3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy91.1%Runner-up multi-agent 89.8%+1.3ppGSM8K test splitTable 2 main resultsTable 2
Accuracy93.3%Runner-up multi-agent 86.7%+6.6ppAIME-2025 (30 problems)Table 2 main resultsTable 2

What To Try In 7 Days

Run a small PRISM pilot: 3 role-based proposers (Minimalist, Skeptic, Explorer), 1 reviewer, 3 synthesis iterations on an executable task.

If you have test suites or validators, add execution grounding to selection instead of relying on self-critique.

Measure token cost vs accuracy and plot a simple Pareto curve to see if joint optimization beats scaling a single model.

Agent Features

Memory
closed-loop short-term validation (re-execute & refine)
Planning
iterative synthesis with closed-loop validation
Tool Use
sandboxed code executionfunction calling/schema validationLLM-based pseudo-verifier when execution not available
Frameworks
PRISM
Is Agentic

Yes

Architectures
role-based proposers + reviewer(s) + synthesizer pipelineiterative closed-loop synthesis modeled as a potential game
Collaboration
evidence-based cross-evaluation (reviews)role diversity to induce complementary behavior

Optimization Features

Token Efficiency
Accuracyparallel propose/review phases to reduce wall-clock
Infra Optimization
use of parallel calls to lower wall-clock; token cost used as proxy
System Optimization
fully parallel proposer and reviewer callsclosed-loop re-execution to avoid brute-force stacking
Inference Optimization
test-time compute scaling via multi-agent orchestrationtemperature=0 deterministic synthesis for logical rigor

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

GSM8KAIME-2025MBPPBFCL-SP

Risks & Boundaries

Limitations

Theoretical guarantees rely on deterministic execution or reviewer independence; these assumptions weaken for open-ended tasks without verifiers.

Experiments use one base model family, so gains may differ with heterogeneous ensembles.

When Not To Use

Open-ended dialogue or creative text generation where no execution or reliable verifier exists (G_info ≈ 0).

When reviewer/verifier accuracy is too low and adding reviewers is expensive compared to getting ground-truth signals.

Failure Modes

Voting amplifies correlated errors when correct answers fragment or p_i < 0.5.

Verifier or reviewer biases can misclassify evidence, harming synthesis.

Core Entities

Models

Qwen3-30B-A3B-InstructQwen3-235B-A22B-InstructDeepSeek-V3.2

Metrics

AccuracyPass@1Token Cost

Datasets

GSM8KAIME-2025MBPPBFCL-SP

Benchmarks

GSM8KAIME-2025MBPPBFCL-SP

Context Entities

Models

Qwen330B-A3B-Instruct-2507 (used as base in experiments)

Metrics

Accuracy