Overview
ShapleyFlow is a practical, interpretable method to guide component upgrades; it costs more compute but gives clear ROI signals for which module to improve.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/5
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
ShapleyFlow helps you decide which component (planning, reasoning, action, reflection) to upgrade for a specific workflow, so you spend compute and engineering budget where it yields the largest accuracy or reward gains.
Who Should Care
Summary TLDR
ShapleyFlow applies Shapley values from cooperative game theory to attribute and optimize components in agentic workflows (Planning, Reasoning, Action, Reflection). The authors build CapaBench (1,535 tasks across 7 domains) and exhaustively evaluate 16 component configurations per workflow using 9 LLMs. ShapleyFlow finds task-specific optimal component mixes (often beating single-LLM agents), surfaces when Action or Reasoning matter most, and provides a reproducible method for workflow design at the cost of 2^n evaluations or the need for approximations.
Problem Statement
Current evaluations of agentic systems treat the workflow as a black box and miss how individual components and their interactions drive outcomes. We need a principled, quantitative way to attribute component contributions and recommend which components to upgrade for each task.
Main Contribution
ShapleyFlow: a game-theoretic framework that computes Shapley values over workflow components to attribute marginal and interaction effects.
CapaBench: a 1,535-task benchmark spanning shopping, navigation, ticketing, math, theorem proving, OS, and robot cooperation for component-level analysis.
Key Findings
ShapleyFlow discovers task-specific optimal workflows that outperform single-LLM baselines.
Action upgrades drive the largest gains on computation/precision tasks like Math and ATP.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Total tasks evaluated | 1,535 tasks | — | — | CapaBench (7 domains) | CapaBench construction; Table 1 | Table 1 |
| Config search size | 16 configurations per workflow | n+1 ablation: 5 | 16 vs 5 evaluations | 4-component architecture (P,R,A,F) | ShapleyFlow algorithm; Comparative Analysis | Algorithm 1; Comparative Analysis |
What To Try In 7 Days
Run ShapleyFlow on a small set (10–50) of representative tasks to find the highest-Shapley component.
If exhaustive 2^n is too costly, run a KernelSHAP or sampling approximation to get near-term guidance.
Prioritize Action upgrades for precision tasks (math, code, formal proof) and Reasoning/Planning for interactive or control tasks (OS, robots).
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Exhaustive Shapley computation scales as 2^n and becomes expensive for many components.
Task success rate may under-report benefits of reflection; reflection impact depends on metric choice.
When Not To Use
When you cannot afford 2^n evaluations and no approximation is acceptable.
When you only need coarse, one-off comparisons rather than interaction-aware attribution.
Failure Modes
Reflection may appear useless under success-rate metrics even if it improves debugging.
Shapley recommendations depend on the chosen baseline model; wrong baseline can mislead absolute value estimates.

