Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
PRISM shows you can get higher accuracy and much better token-cost efficiency by coordinating small models with execution grounding and evidence-based synthesis, rather than only buying bigger models.
Summary TLDR
The paper introduces PRISM, a four-phase multi-agent workflow (Propose, Execute, Review, Synthesize) guided by a formal decomposition of multi-agent gains into Exploration (diverse candidates), Information (high-fidelity feedback), and Aggregation (evidence-based synthesis). PRISM uses role-based proposers, execution or pseudo-verification feedback, and iterative closed-loop synthesis. Across GSM8K, AIME-2025, MBPP, and BFCL-SP it outperforms existing multi-agent baselines and shows better compute-efficiency on token-cost vs accuracy Pareto fronts. The paper provides proofs that execution feedback is information-theoretically optimal for deterministic tasks and convergence guarantees for its
Problem Statement
Multi-agent LLM methods help reasoning but are heuristic and fragmented. Practitioners lack a principled view of why teams of LLMs help, which parts to optimize, and how to allocate compute. This paper provides a unified decomposition of multi-agent gains and a practical system (PRISM) that jointly optimizes the three identified dimensions to get sustained accuracy scaling and better compute-efficiency.
Main Contribution
A formal three-way decomposition of multi-agent gains: Exploration, Information, Aggregation.
PRISM: a four-phase multi-agent workflow (Propose, Execute, Review, Synthesize) that jointly optimizes all three dimensions.
Theoretical results: execution feedback is information-theoretically optimal for deterministic tasks and the synthesis loop has convergence guarantees.
Extensive empirical evaluation showing improved accuracy and token-efficiency across math, code, and function-calling benchmarks.
Key Findings
PRISM outperforms all tested multi-agent baselines on four benchmarks.
Execution feedback yields maximal selection information for deterministic tasks.
Role-based diversity reduces pairwise success correlation.
PRISM achieves strong token-efficiency on MBPP.
Joint optimization shows subadditivity but outperforms single-dimension methods.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Token-efficiency (MBPP)
Who Should Care
What To Try In 7 Days
Run a small PRISM pilot: 3 role-based proposers (Minimalist, Skeptic, Explorer), 1 reviewer, 3 synthesis iterations on an executable task.
If you have test suites or validators, add execution grounding to selection instead of relying on self-critique.
Measure token cost vs accuracy and plot a simple Pareto curve to see if joint optimization beats scaling a single model.
Agent Features
Memory
- closed-loop short-term validation (re-execute & refine)
Planning
- iterative synthesis with closed-loop validation
Tool Use
- sandboxed code execution
- function calling/schema validation
- LLM-based pseudo-verifier when execution not available
Frameworks
- PRISM
Is Agentic
true
Architectures
- role-based proposers + reviewer(s) + synthesizer pipeline
- iterative closed-loop synthesis modeled as a potential game
Collaboration
- evidence-based cross-evaluation (reviews)
- role diversity to induce complementary behavior
Optimization Features
Token Efficiency
- Accuracy
- parallel propose/review phases to reduce wall-clock
Infra Optimization
- use of parallel calls to lower wall-clock; token cost used as proxy
System Optimization
- fully parallel proposer and reviewer calls
- closed-loop re-execution to avoid brute-force stacking
Inference Optimization
- test-time compute scaling via multi-agent orchestration
- temperature=0 deterministic synthesis for logical rigor
Reproducibility
Data Urls
- GSM8K
- AIME-2025
- MBPP
- BFCL-SP
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Theoretical guarantees rely on deterministic execution or reviewer independence; these assumptions weaken for open-ended tasks without verifiers.
- Experiments use one base model family, so gains may differ with heterogeneous ensembles.
- AIME-2025 results are limited by small sample size (N=30) and wide confidence intervals.
- Token cost is used as a proxy for compute; real infra costs and latency effects need separate analysis.
When Not To Use
- Open-ended dialogue or creative text generation where no execution or reliable verifier exists (G_info ≈ 0).
- When reviewer/verifier accuracy is too low and adding reviewers is expensive compared to getting ground-truth signals.
Failure Modes
- Voting amplifies correlated errors when correct answers fragment or p_i < 0.5.
- Verifier or reviewer biases can misclassify evidence, harming synthesis.
- Synthesis agent itself can introduce new errors; closed-loop validation mitigates but may not eliminate this.
Core Entities
Models
- Qwen3-30B-A3B-Instruct
- Qwen3-235B-A22B-Instruct
- DeepSeek-V3.2
Metrics
- Accuracy
- Pass@1
- Token Cost
Datasets
- GSM8K
- AIME-2025
- MBPP
- BFCL-SP
Benchmarks
- GSM8K
- AIME-2025
- MBPP
- BFCL-SP
Context Entities
Models
- Qwen330B-A3B-Instruct-2507 (used as base in experiments)
Metrics
- Accuracy

