Use Shapley values to explain and pick the best component mix for AI agent workflows

Overview

Decision SnapshotReady For Pilot

ShapleyFlow is a practical, interpretable method to guide component upgrades; it costs more compute but gives clear ROI signals for which module to improve.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 65%

Authors

Yingxuan Yang, Bo Huang, Siyuan Qi, Chao Feng, Haoyi Hu, Yuxuan Zhu, Jinbo Hu, Haoran Zhao, Ziyi He, Xiao Liu, Muning Wen, Zongyu Wang, Lin Qiu, Xuezhi Cao, Xunliang Cai, Yong Yu, Weinan Zhang

Links

Abstract / PDF

Why It Matters For Business

ShapleyFlow helps you decide which component (planning, reasoning, action, reflection) to upgrade for a specific workflow, so you spend compute and engineering budget where it yields the largest accuracy or reward gains.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

ShapleyFlow applies Shapley values from cooperative game theory to attribute and optimize components in agentic workflows (Planning, Reasoning, Action, Reflection). The authors build CapaBench (1,535 tasks across 7 domains) and exhaustively evaluate 16 component configurations per workflow using 9 LLMs. ShapleyFlow finds task-specific optimal component mixes (often beating single-LLM agents), surfaces when Action or Reasoning matter most, and provides a reproducible method for workflow design at the cost of 2^n evaluations or the need for approximations.

Problem Statement

Current evaluations of agentic systems treat the workflow as a black box and miss how individual components and their interactions drive outcomes. We need a principled, quantitative way to attribute component contributions and recommend which components to upgrade for each task.

Main Contribution

ShapleyFlow: a game-theoretic framework that computes Shapley values over workflow components to attribute marginal and interaction effects.

CapaBench: a 1,535-task benchmark spanning shopping, navigation, ticketing, math, theorem proving, OS, and robot cooperation for component-level analysis.

Key Findings

ShapleyFlow discovers task-specific optimal workflows that outperform single-LLM baselines.

NumbersE-commerce optimal accuracy 43.31%; ATP (theorem proving) optimal 86.79%

Practical UseRun Shapley-based attribution to pick which components to upgrade instead of swapping the whole agent; you can gain large accuracy jumps on some tasks.

Evidence RefTable 3; 'Optimal Workflow Discovery' paragraph

Action upgrades drive the largest gains on computation/precision tasks like Math and ATP.

NumbersMath Acc best 83.80% with Action Shapley up to 0.483; ATP Acc best 86.79% with Action Shapley up to 0.660

Practical UsePrioritize stronger action/execution models (tool-use, code generation) for math and formal-verification workloads.

Evidence RefTable 3; Table 4; Table 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Total tasks evaluated	1,535 tasks	—	—	CapaBench (7 domains)	CapaBench construction; Table 1	Table 1
Config search size	16 configurations per workflow	n+1 ablation: 5	16 vs 5 evaluations	4-component architecture (P,R,A,F)	ShapleyFlow algorithm; Comparative Analysis	Algorithm 1; Comparative Analysis

What To Try In 7 Days

Run ShapleyFlow on a small set (10–50) of representative tasks to find the highest-Shapley component.

If exhaustive 2^n is too costly, run a KernelSHAP or sampling approximation to get near-term guidance.

Prioritize Action upgrades for precision tasks (math, code, formal proof) and Reasoning/Planning for interactive or control tasks (OS, robots).

Agent Features

Memory

short-term multi-turn interaction between Reasoning and Action

Planning

single-turn planning for task decompositionplanning informs multi-turn reasoning/action

Tool Use

tool use emphasized in Math solver and OS tasks

Frameworks

ReAct-style multi-turn interactionShapleyFlow attribution framework

Is Agentic

Yes

Architectures

Planning-Reasoning-Action-Reflection (P,R,A,F)

Collaboration

component orchestration modeled as cooperative game

Optimization Features

Infra Optimization

use lightweight baseline model (Llama3-8B-Instruct) for large sweeps

System Optimization

replace only high-Shapley components rather than entire agent

Inference Optimization

use sampling-based Shapley approximations (KernelSHAP, SVARM) to reduce 2^n cost

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Exhaustive Shapley computation scales as 2^n and becomes expensive for many components.

Task success rate may under-report benefits of reflection; reflection impact depends on metric choice.

When Not To Use

When you cannot afford 2^n evaluations and no approximation is acceptable.

When you only need coarse, one-off comparisons rather than interaction-aware attribution.

Failure Modes

Reflection may appear useless under success-rate metrics even if it improves debugging.

Shapley recommendations depend on the chosen baseline model; wrong baseline can mislead absolute value estimates.

Core Entities

Models

Llama3-8B-InstructLlama3-70B-InstructClaude-3.5-Sonnetgpt-4-turbogpt-4o-miniqwen2.5-32BMistral-8X7BMistral-7Bdoubao-pro-4kGLM-4-air

Metrics

task success rateAccuracyreward

Datasets

CapaBenchOnline ShoppingNavigation PlanningTicket OrderingMath (Algebra, Geometry)Automatic Theorem Proving (Coq, Lean4, Isabelle)Operating System (Ubuntu, Git)RobotCoop

Use Shapley values to explain and pick the best component mix for AI agent workflows

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ShapleyFlow discovers task-specific optimal workflows that outperform single-LLM baselines.

Action upgrades drive the largest gains on computation/precision tasks like Math and ATP.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ShapleyFlow discovers task-specific optimal workflows that outperform single-LLM baselines.

Action upgrades drive the largest gains on computation/precision tasks like Math and ATP.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

MLRC-Bench: a competition-based benchmark that tests if LLM agents can propose and implement novel ML research

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding