Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
ShapleyFlow helps you decide which component (planning, reasoning, action, reflection) to upgrade for a specific workflow, so you spend compute and engineering budget where it yields the largest accuracy or reward gains.
Summary TLDR
ShapleyFlow applies Shapley values from cooperative game theory to attribute and optimize components in agentic workflows (Planning, Reasoning, Action, Reflection). The authors build CapaBench (1,535 tasks across 7 domains) and exhaustively evaluate 16 component configurations per workflow using 9 LLMs. ShapleyFlow finds task-specific optimal component mixes (often beating single-LLM agents), surfaces when Action or Reasoning matter most, and provides a reproducible method for workflow design at the cost of 2^n evaluations or the need for approximations.
Problem Statement
Current evaluations of agentic systems treat the workflow as a black box and miss how individual components and their interactions drive outcomes. We need a principled, quantitative way to attribute component contributions and recommend which components to upgrade for each task.
Main Contribution
ShapleyFlow: a game-theoretic framework that computes Shapley values over workflow components to attribute marginal and interaction effects.
CapaBench: a 1,535-task benchmark spanning shopping, navigation, ticketing, math, theorem proving, OS, and robot cooperation for component-level analysis.
Empirical guidance: exhaustive 16-configuration evaluations across 9 LLMs that identify task-specific optimal component mixes and regular patterns (e.g., Action-dominant vs Reasoning-dominant tasks).
Key Findings
ShapleyFlow discovers task-specific optimal workflows that outperform single-LLM baselines.
Action upgrades drive the largest gains on computation/precision tasks like Math and ATP.
Reasoning and Planning are most important for interactive/control tasks (OS, RobotCoop, Ticket).
Shapley attributions correlate well with an independent LLM judge.
Comprehensive coalition evaluation is more costly but yields richer insights than simple ablation.
Reflection component shows low Shapley values under task success metrics.
Results
Total tasks evaluated
Config search size
Accuracy
Accuracy
Attribution consistency vs LLM-judge
Who Should Care
What To Try In 7 Days
Run ShapleyFlow on a small set (10–50) of representative tasks to find the highest-Shapley component.
If exhaustive 2^n is too costly, run a KernelSHAP or sampling approximation to get near-term guidance.
Prioritize Action upgrades for precision tasks (math, code, formal proof) and Reasoning/Planning for interactive or control tasks (OS, robots).
Agent Features
Memory
- short-term multi-turn interaction between Reasoning and Action
Planning
- single-turn planning for task decomposition
- planning informs multi-turn reasoning/action
Tool Use
- tool use emphasized in Math solver and OS tasks
Frameworks
- ReAct-style multi-turn interaction
- ShapleyFlow attribution framework
Is Agentic
true
Architectures
- Planning-Reasoning-Action-Reflection (P,R,A,F)
Collaboration
- component orchestration modeled as cooperative game
Optimization Features
Infra Optimization
- use lightweight baseline model (Llama3-8B-Instruct) for large sweeps
System Optimization
- replace only high-Shapley components rather than entire agent
Inference Optimization
- use sampling-based Shapley approximations (KernelSHAP, SVARM) to reduce 2^n cost
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Exhaustive Shapley computation scales as 2^n and becomes expensive for many components.
- Task success rate may under-report benefits of reflection; reflection impact depends on metric choice.
- Absolute Shapley values shift with baseline model strength, requiring sensitivity checks.
When Not To Use
- When you cannot afford 2^n evaluations and no approximation is acceptable.
- When you only need coarse, one-off comparisons rather than interaction-aware attribution.
- When downstream metrics value qualities not captured by task success rate (e.g., interpretability).
Failure Modes
- Reflection may appear useless under success-rate metrics even if it improves debugging.
- Shapley recommendations depend on the chosen baseline model; wrong baseline can mislead absolute value estimates.
- Approximate Shapley estimators can miss rare but important coalition effects if sampling is too sparse.
Core Entities
Models
- Llama3-8B-Instruct
- Llama3-70B-Instruct
- Claude-3.5-Sonnet
- gpt-4-turbo
- gpt-4o-mini
- qwen2.5-32B
- Mistral-8X7B
- Mistral-7B
- doubao-pro-4k
- GLM-4-air
Metrics
- task success rate
- Accuracy
- reward
Datasets
- CapaBench
- Online Shopping
- Navigation Planning
- Ticket Ordering
- Math (Algebra, Geometry)
- Automatic Theorem Proving (Coq, Lean4, Isabelle)
- Operating System (Ubuntu, Git)
- RobotCoop
Benchmarks
- CapaBench
Context Entities
Models
- GPT-4 (used to synthesize/expand datasets)

