PRISM: jointly optimize Exploration, Information, and Aggregation for cheaper, more reliable multi-agent LLM reasoning

Overview

Decision SnapshotReady For Pilot

The paper gives formal proofs for the decomposition and convergence in deterministic settings and supports claims with multi-benchmark experiments and ablations, but some guarantees weaken for non-executable tasks and the study uses one model family.

Citations2

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 70%

Authors

Yiming Yang, Zhuoyuan Li, Fanxiang Zeng, Hao Fu, Yue Liu

Links

Abstract / PDF / Data

Why It Matters For Business

PRISM shows you can get higher accuracy and much better token-cost efficiency by coordinating small models with execution grounding and evidence-based synthesis, rather than only buying bigger models.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The paper introduces PRISM, a four-phase multi-agent workflow (Propose, Execute, Review, Synthesize) guided by a formal decomposition of multi-agent gains into Exploration (diverse candidates), Information (high-fidelity feedback), and Aggregation (evidence-based synthesis). PRISM uses role-based proposers, execution or pseudo-verification feedback, and iterative closed-loop synthesis. Across GSM8K, AIME-2025, MBPP, and BFCL-SP it outperforms existing multi-agent baselines and shows better compute-efficiency on token-cost vs accuracy Pareto fronts. The paper provides proofs that execution feedback is information-theoretically optimal for deterministic tasks and convergence guarantees for its

Problem Statement

Multi-agent LLM methods help reasoning but are heuristic and fragmented. Practitioners lack a principled view of why teams of LLMs help, which parts to optimize, and how to allocate compute. This paper provides a unified decomposition of multi-agent gains and a practical system (PRISM) that jointly optimizes the three identified dimensions to get sustained accuracy scaling and better compute-efficiency.

Main Contribution

A formal three-way decomposition of multi-agent gains: Exploration, Information, Aggregation.

PRISM: a four-phase multi-agent workflow (Propose, Execute, Review, Synthesize) that jointly optimizes all three dimensions.

Key Findings

PRISM outperforms all tested multi-agent baselines on four benchmarks.

NumbersGSM8K +1.3pp; AIME +6.6pp; MBPP +6.6pp; BFCL-SP +3.5pp (vs runner-up)

Practical UseUse PRISM-style joint optimization to gain several percentage points over current multi-agent methods on math, code, and function-calling tasks.

Evidence RefMain results Table 2

Execution feedback yields maximal selection information for deterministic tasks.

NumbersI(Q;e)=H(Q) for executable tasks (Theorem 3.5a)

Practical UseWhen you can run or validate outputs (tests, schema checks), prioritize execution grounding—it gives the best signal for choosing correct solutions.

Evidence RefTheorem 3.5a, Remark 3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	91.1%	Runner-up multi-agent 89.8%	+1.3pp	GSM8K test split	Table 2 main results	Table 2
Accuracy	93.3%	Runner-up multi-agent 86.7%	+6.6pp	AIME-2025 (30 problems)	Table 2 main results	Table 2

What To Try In 7 Days

Run a small PRISM pilot: 3 role-based proposers (Minimalist, Skeptic, Explorer), 1 reviewer, 3 synthesis iterations on an executable task.

If you have test suites or validators, add execution grounding to selection instead of relying on self-critique.

Measure token cost vs accuracy and plot a simple Pareto curve to see if joint optimization beats scaling a single model.

Agent Features

Memory

closed-loop short-term validation (re-execute & refine)

Planning

iterative synthesis with closed-loop validation

Tool Use

sandboxed code executionfunction calling/schema validationLLM-based pseudo-verifier when execution not available

Frameworks

PRISM

Is Agentic

Yes

Architectures

role-based proposers + reviewer(s) + synthesizer pipelineiterative closed-loop synthesis modeled as a potential game

Collaboration

evidence-based cross-evaluation (reviews)role diversity to induce complementary behavior

Optimization Features

Token Efficiency

Accuracyparallel propose/review phases to reduce wall-clock

Infra Optimization

use of parallel calls to lower wall-clock; token cost used as proxy

System Optimization

fully parallel proposer and reviewer callsclosed-loop re-execution to avoid brute-force stacking

Inference Optimization

test-time compute scaling via multi-agent orchestrationtemperature=0 deterministic synthesis for logical rigor

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

GSM8KAIME-2025MBPPBFCL-SP

Risks & Boundaries

Limitations

Theoretical guarantees rely on deterministic execution or reviewer independence; these assumptions weaken for open-ended tasks without verifiers.

Experiments use one base model family, so gains may differ with heterogeneous ensembles.

When Not To Use

Open-ended dialogue or creative text generation where no execution or reliable verifier exists (G_info ≈ 0).

When reviewer/verifier accuracy is too low and adding reviewers is expensive compared to getting ground-truth signals.

Failure Modes

Voting amplifies correlated errors when correct answers fragment or p_i < 0.5.

Verifier or reviewer biases can misclassify evidence, harming synthesis.

Core Entities

Models

Qwen3-30B-A3B-InstructQwen3-235B-A22B-InstructDeepSeek-V3.2

Metrics

AccuracyPass@1Token Cost

Datasets

GSM8KAIME-2025MBPPBFCL-SP

Benchmarks

GSM8KAIME-2025MBPPBFCL-SP

Context Entities

Models

Qwen330B-A3B-Instruct-2507 (used as base in experiments)

Metrics

Accuracy

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

PRISM outperforms all tested multi-agent baselines on four benchmarks.

Execution feedback yields maximal selection information for deterministic tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding