Overview
ShortcutsBench is practical and novel: it uses real, human-created workflows and reveals concrete agent failures (API choice, parameter extraction, ask-for-input). Evidence includes dataset scale and multi-model evaluations reported in the paper.
Citations1
Evidence Strength0.80
Confidence0.86
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 65%
Production readiness: 70%
Novelty: 80%
Why It Matters For Business
ShortcutsBench tests end-to-end agent behaviors (API choice, parameter filling, asking for inputs) on real user workflows, revealing practical failure points that matter for automation reliability and cost-effective model selection.
Who Should Care
Summary TLDR
ShortcutsBench is a large public benchmark built from real Apple Shortcuts (88 apps, 1,414 APIs, 7,627 user-driven shortcuts). It provides human-annotated multi-step action sequences, exact parameter values (including enums and previous-action outputs), and checks whether agents ask for missing system/user inputs. Evaluations of 10 leading LLMs show API selection is the main bottleneck, parameter filling is a secondary issue for smaller models, and agents generally fail to ask for required inputs. All data and code are released on GitHub.
Problem Statement
Current tool-use benchmarks are too synthetic or small to separate modern LLM agents. They either use few or hand-crafted APIs, short action sequences, or omit parameter-filling and 'ask-for-input' behaviors. This hides weaknesses of stronger LLMs and fails to evaluate real end-to-end behavior for API-based agents.
Main Contribution
Built ShortcutsBench: 7,627 real shortcuts from Apple Shortcuts, covering 88 apps and 1,414 real APIs with human-annotated action sequences and precise parameter values.
Defined three focused evaluation tasks: API selection, API parameter value filling (primitives, enums, prior-action outputs), and recognition of required system/user input.
Key Findings
ShortcutsBench scale and realism: 88 apps, 1,414 APIs, 7,627 shortcuts.
API selection accuracy falls sharply with task difficulty.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset scale | 88 apps; 1,414 APIs; 7,627 shortcuts | — | — | ShortcutsBench | Table 2; Sec.3.1 | Table 2 |
| Avg actions per shortcut (evaluation set) | 8.34 actions (overall); 21.62 in full dataset average | — | — | Table 3; Table 2 | Table 3; Table 2 | Table 3; Table 2 |
What To Try In 7 Days
Run ShortcutsBench on your agent to see if it fails API selection before tuning parameter-fill modules.
Add an explicit 'missing-input detector' that scans parameters for Ask/Clipboard/ExtensionInput/CurrentDate and prompts the user.
If you use open-source models ≥70B, validate on mid/high complexity workflows—they may match closed models on simple tasks but fail on complex ones.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Reproducibility
Risks & Boundaries
Limitations
Dataset is Apple Shortcuts–centric; apps/APIs reflect Apple ecosystem only.
Evaluation excludes shortcuts longer than 30 actions and use of runworkflow calls.
When Not To Use
You need benchmarks for Android or non-Apple ecosystems.
You require end-to-end final-result validation involving complex binary formats.
Failure Modes
Agent chooses the wrong API even when API is in context (API selection failure).
Agent fails to extract explicit parameters from natural queries (parameter extraction error).

