Use Shapley values to explain and pick the best component mix for AI agent workflows

February 1, 20258 min

Overview

Decision SnapshotReady For Pilot

ShapleyFlow is a practical, interpretable method to guide component upgrades; it costs more compute but gives clear ROI signals for which module to improve.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 65%

Authors

Yingxuan Yang, Bo Huang, Siyuan Qi, Chao Feng, Haoyi Hu, Yuxuan Zhu, Jinbo Hu, Haoran Zhao, Ziyi He, Xiao Liu, Muning Wen, Zongyu Wang, Lin Qiu, Xuezhi Cao, Xunliang Cai, Yong Yu, Weinan Zhang

Links

Abstract / PDF

Why It Matters For Business

ShapleyFlow helps you decide which component (planning, reasoning, action, reflection) to upgrade for a specific workflow, so you spend compute and engineering budget where it yields the largest accuracy or reward gains.

Who Should Care

Summary TLDR

ShapleyFlow applies Shapley values from cooperative game theory to attribute and optimize components in agentic workflows (Planning, Reasoning, Action, Reflection). The authors build CapaBench (1,535 tasks across 7 domains) and exhaustively evaluate 16 component configurations per workflow using 9 LLMs. ShapleyFlow finds task-specific optimal component mixes (often beating single-LLM agents), surfaces when Action or Reasoning matter most, and provides a reproducible method for workflow design at the cost of 2^n evaluations or the need for approximations.

Problem Statement

Current evaluations of agentic systems treat the workflow as a black box and miss how individual components and their interactions drive outcomes. We need a principled, quantitative way to attribute component contributions and recommend which components to upgrade for each task.

Main Contribution

ShapleyFlow: a game-theoretic framework that computes Shapley values over workflow components to attribute marginal and interaction effects.

CapaBench: a 1,535-task benchmark spanning shopping, navigation, ticketing, math, theorem proving, OS, and robot cooperation for component-level analysis.

Key Findings

ShapleyFlow discovers task-specific optimal workflows that outperform single-LLM baselines.

NumbersE-commerce optimal accuracy 43.31%; ATP (theorem proving) optimal 86.79%

Practical UseRun Shapley-based attribution to pick which components to upgrade instead of swapping the whole agent; you can gain large accuracy jumps on some tasks.

Evidence RefTable 3; 'Optimal Workflow Discovery' paragraph

Action upgrades drive the largest gains on computation/precision tasks like Math and ATP.

NumbersMath Acc best 83.80% with Action Shapley up to 0.483; ATP Acc best 86.79% with Action Shapley up to 0.660

Practical UsePrioritize stronger action/execution models (tool-use, code generation) for math and formal-verification workloads.

Evidence RefTable 3; Table 4; Table 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Total tasks evaluated1,535 tasksCapaBench (7 domains)CapaBench construction; Table 1Table 1
Config search size16 configurations per workflown+1 ablation: 516 vs 5 evaluations4-component architecture (P,R,A,F)ShapleyFlow algorithm; Comparative AnalysisAlgorithm 1; Comparative Analysis

What To Try In 7 Days

Run ShapleyFlow on a small set (10–50) of representative tasks to find the highest-Shapley component.

If exhaustive 2^n is too costly, run a KernelSHAP or sampling approximation to get near-term guidance.

Prioritize Action upgrades for precision tasks (math, code, formal proof) and Reasoning/Planning for interactive or control tasks (OS, robots).

Agent Features

Memory
short-term multi-turn interaction between Reasoning and Action
Planning
single-turn planning for task decompositionplanning informs multi-turn reasoning/action
Tool Use
tool use emphasized in Math solver and OS tasks
Frameworks
ReAct-style multi-turn interactionShapleyFlow attribution framework
Is Agentic

Yes

Architectures
Planning-Reasoning-Action-Reflection (P,R,A,F)
Collaboration
component orchestration modeled as cooperative game

Optimization Features

Infra Optimization
use lightweight baseline model (Llama3-8B-Instruct) for large sweeps
System Optimization
replace only high-Shapley components rather than entire agent
Inference Optimization
use sampling-based Shapley approximations (KernelSHAP, SVARM) to reduce 2^n cost

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Exhaustive Shapley computation scales as 2^n and becomes expensive for many components.

Task success rate may under-report benefits of reflection; reflection impact depends on metric choice.

When Not To Use

When you cannot afford 2^n evaluations and no approximation is acceptable.

When you only need coarse, one-off comparisons rather than interaction-aware attribution.

Failure Modes

Reflection may appear useless under success-rate metrics even if it improves debugging.

Shapley recommendations depend on the chosen baseline model; wrong baseline can mislead absolute value estimates.

Core Entities

Models

Llama3-8B-InstructLlama3-70B-InstructClaude-3.5-Sonnetgpt-4-turbogpt-4o-miniqwen2.5-32BMistral-8X7BMistral-7Bdoubao-pro-4kGLM-4-air

Metrics

task success rateAccuracyreward

Datasets

CapaBenchOnline ShoppingNavigation PlanningTicket OrderingMath (Algebra, Geometry)Automatic Theorem Proving (Coq, Lean4, Isabelle)Operating System (Ubuntu, Git)RobotCoop

Benchmarks

CapaBench

Context Entities

Models

GPT-4 (used to synthesize/expand datasets)