Use Shapley values to explain and pick the best component mix for AI agent workflows

February 1, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.6

Citation Count

1

Authors

Yingxuan Yang, Bo Huang, Siyuan Qi, Chao Feng, Haoyi Hu, Yuxuan Zhu, Jinbo Hu, Haoran Zhao, Ziyi He, Xiao Liu, Muning Wen, Zongyu Wang, Lin Qiu, Xuezhi Cao, Xunliang Cai, Yong Yu, Weinan Zhang

Links

Abstract / PDF

Why It Matters For Business

ShapleyFlow helps you decide which component (planning, reasoning, action, reflection) to upgrade for a specific workflow, so you spend compute and engineering budget where it yields the largest accuracy or reward gains.

Summary TLDR

ShapleyFlow applies Shapley values from cooperative game theory to attribute and optimize components in agentic workflows (Planning, Reasoning, Action, Reflection). The authors build CapaBench (1,535 tasks across 7 domains) and exhaustively evaluate 16 component configurations per workflow using 9 LLMs. ShapleyFlow finds task-specific optimal component mixes (often beating single-LLM agents), surfaces when Action or Reasoning matter most, and provides a reproducible method for workflow design at the cost of 2^n evaluations or the need for approximations.

Problem Statement

Current evaluations of agentic systems treat the workflow as a black box and miss how individual components and their interactions drive outcomes. We need a principled, quantitative way to attribute component contributions and recommend which components to upgrade for each task.

Main Contribution

ShapleyFlow: a game-theoretic framework that computes Shapley values over workflow components to attribute marginal and interaction effects.

CapaBench: a 1,535-task benchmark spanning shopping, navigation, ticketing, math, theorem proving, OS, and robot cooperation for component-level analysis.

Empirical guidance: exhaustive 16-configuration evaluations across 9 LLMs that identify task-specific optimal component mixes and regular patterns (e.g., Action-dominant vs Reasoning-dominant tasks).

Key Findings

ShapleyFlow discovers task-specific optimal workflows that outperform single-LLM baselines.

NumbersE-commerce optimal accuracy 43.31%; ATP (theorem proving) optimal 86.79%

Action upgrades drive the largest gains on computation/precision tasks like Math and ATP.

NumbersMath Acc best 83.80% with Action Shapley up to 0.483; ATP Acc best 86.79% with Action Shapley up to 0.660

Reasoning and Planning are most important for interactive/control tasks (OS, RobotCoop, Ticket).

NumbersRobotCoop reward best 92.63% with Reasoning Shapley up to 0.388; Navigation best 74.42% with high Reasoning/Planning Shp

Shapley attributions correlate well with an independent LLM judge.

NumbersCorrelation with GPT-o1-mini: Planning 0.81, Reasoning 0.77, Action 0.67

Comprehensive coalition evaluation is more costly but yields richer insights than simple ablation.

NumbersExhaustive cost 2^n evaluations (16 for 4 components) vs n+1 (5) for ablation

Reflection component shows low Shapley values under task success metrics.

NumbersReflection often near-zero or negative Shapley across datasets

Results

Total tasks evaluated

Value1,535 tasks

Config search size

Value16 configurations per workflow

Baselinen+1 ablation: 5

Accuracy

Value83.80% (best config)

Baselinellama3-8B-instruct 21.6% (Algebra)

Accuracy

Value86.79% (best config)

Baselinellama3-8B 6.4% (Coq)

Attribution consistency vs LLM-judge

ValuePlanning 0.81, Reasoning 0.77, Action 0.67

Who Should Care

What To Try In 7 Days

Run ShapleyFlow on a small set (10–50) of representative tasks to find the highest-Shapley component.

If exhaustive 2^n is too costly, run a KernelSHAP or sampling approximation to get near-term guidance.

Prioritize Action upgrades for precision tasks (math, code, formal proof) and Reasoning/Planning for interactive or control tasks (OS, robots).

Agent Features

Memory

  • short-term multi-turn interaction between Reasoning and Action

Planning

  • single-turn planning for task decomposition
  • planning informs multi-turn reasoning/action

Tool Use

  • tool use emphasized in Math solver and OS tasks

Frameworks

  • ReAct-style multi-turn interaction
  • ShapleyFlow attribution framework

Is Agentic

true

Architectures

  • Planning-Reasoning-Action-Reflection (P,R,A,F)

Collaboration

  • component orchestration modeled as cooperative game

Optimization Features

Infra Optimization

  • use lightweight baseline model (Llama3-8B-Instruct) for large sweeps

System Optimization

  • replace only high-Shapley components rather than entire agent

Inference Optimization

  • use sampling-based Shapley approximations (KernelSHAP, SVARM) to reduce 2^n cost

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Exhaustive Shapley computation scales as 2^n and becomes expensive for many components.
  • Task success rate may under-report benefits of reflection; reflection impact depends on metric choice.
  • Absolute Shapley values shift with baseline model strength, requiring sensitivity checks.

When Not To Use

  • When you cannot afford 2^n evaluations and no approximation is acceptable.
  • When you only need coarse, one-off comparisons rather than interaction-aware attribution.
  • When downstream metrics value qualities not captured by task success rate (e.g., interpretability).

Failure Modes

  • Reflection may appear useless under success-rate metrics even if it improves debugging.
  • Shapley recommendations depend on the chosen baseline model; wrong baseline can mislead absolute value estimates.
  • Approximate Shapley estimators can miss rare but important coalition effects if sampling is too sparse.

Core Entities

Models

  • Llama3-8B-Instruct
  • Llama3-70B-Instruct
  • Claude-3.5-Sonnet
  • gpt-4-turbo
  • gpt-4o-mini
  • qwen2.5-32B
  • Mistral-8X7B
  • Mistral-7B
  • doubao-pro-4k
  • GLM-4-air

Metrics

  • task success rate
  • Accuracy
  • reward

Datasets

  • CapaBench
  • Online Shopping
  • Navigation Planning
  • Ticket Ordering
  • Math (Algebra, Geometry)
  • Automatic Theorem Proving (Coq, Lean4, Isabelle)
  • Operating System (Ubuntu, Git)
  • RobotCoop

Benchmarks

  • CapaBench

Context Entities

Models

  • GPT-4 (used to synthesize/expand datasets)