ShortcutsBench: a realistic Apple Shortcuts dataset to stress-test API-based agents

June 28, 20248 min

Overview

Decision SnapshotNeeds Validation

ShortcutsBench is practical and novel: it uses real, human-created workflows and reveals concrete agent failures (API choice, parameter extraction, ask-for-input). Evidence includes dataset scale and multi-model evaluations reported in the paper.

Citations1

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 65%

Production readiness: 70%

Novelty: 80%

Authors

Haiyang Shen, Yue Li, Desong Meng, Dongqi Cai, Sheng Qi, Li Zhang, Mengwei Xu, Yun Ma

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ShortcutsBench tests end-to-end agent behaviors (API choice, parameter filling, asking for inputs) on real user workflows, revealing practical failure points that matter for automation reliability and cost-effective model selection.

Who Should Care

Summary TLDR

ShortcutsBench is a large public benchmark built from real Apple Shortcuts (88 apps, 1,414 APIs, 7,627 user-driven shortcuts). It provides human-annotated multi-step action sequences, exact parameter values (including enums and previous-action outputs), and checks whether agents ask for missing system/user inputs. Evaluations of 10 leading LLMs show API selection is the main bottleneck, parameter filling is a secondary issue for smaller models, and agents generally fail to ask for required inputs. All data and code are released on GitHub.

Problem Statement

Current tool-use benchmarks are too synthetic or small to separate modern LLM agents. They either use few or hand-crafted APIs, short action sequences, or omit parameter-filling and 'ask-for-input' behaviors. This hides weaknesses of stronger LLMs and fails to evaluate real end-to-end behavior for API-based agents.

Main Contribution

Built ShortcutsBench: 7,627 real shortcuts from Apple Shortcuts, covering 88 apps and 1,414 real APIs with human-annotated action sequences and precise parameter values.

Defined three focused evaluation tasks: API selection, API parameter value filling (primitives, enums, prior-action outputs), and recognition of required system/user input.

Key Findings

ShortcutsBench scale and realism: 88 apps, 1,414 APIs, 7,627 shortcuts.

Numbers88 apps; 1,414 APIs; 7,627 shortcuts (Table 2, Sec.3.1)

Practical UseUse ShortcutsBench to stress-test agents on realistic, multi-step API workflows rather than toy or hand-crafted toolsets.

Evidence RefTable 2; Sec.3.1

API selection accuracy falls sharply with task difficulty.

NumbersAverage accuracy dropped 19% from difficulty (0,1] to (1,5] and 46% from (0,1] to (5,15] (Sec.4.2)

Practical UsePrioritize improving agent planning and API-choice logic before focusing on fine-grained parameter handling for complex tasks.

Evidence RefSec.4.2; Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset scale88 apps; 1,414 APIs; 7,627 shortcutsShortcutsBenchTable 2; Sec.3.1Table 2
Avg actions per shortcut (evaluation set)8.34 actions (overall); 21.62 in full dataset averageTable 3; Table 2Table 3; Table 2Table 3; Table 2

What To Try In 7 Days

Run ShortcutsBench on your agent to see if it fails API selection before tuning parameter-fill modules.

Add an explicit 'missing-input detector' that scans parameters for Ask/Clipboard/ExtensionInput/CurrentDate and prompts the user.

If you use open-source models ≥70B, validate on mid/high complexity workflows—they may match closed models on simple tasks but fail on complex ones.

Agent Features

Memory
short-term history passed as context (no external retrieval in eval)
Planning
next-action prediction (stepwise API selection)multi-step workflow planning
Tool Use
API function calling (AppIntents/SiriKit)parameter filling including enums and outputs from prior actions
Frameworks
ReACT-style promptingApple Shortcuts API definitions (.actionsdata, .intentdefinition, WFActions.json)
Is Agentic

Yes

Architectures
LLM-based agent (prompt + tool calls)

Optimization Features

Token Efficiency
context-length management by limiting APIs in context (x×|APIs_i| clipping)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Dataset is Apple Shortcuts–centric; apps/APIs reflect Apple ecosystem only.

Evaluation excludes shortcuts longer than 30 actions and use of runworkflow calls.

When Not To Use

You need benchmarks for Android or non-Apple ecosystems.

You require end-to-end final-result validation involving complex binary formats.

Failure Modes

Agent chooses the wrong API even when API is in context (API selection failure).

Agent fails to extract explicit parameters from natural queries (parameter extraction error).

Core Entities

Models

Gemini-1.5-ProGemini-1.5-FlashQWen-2-72BQWen-2-57BLLaMA-3-70BDeepspeed-2-chatDeepspeed-2-coderGPT-4o-miniGPT-3.5-turboChatGLM-4-AirAgentLMxLAMLemur-70B-Chat-V1

Metrics

Accuracy

Datasets

ShortcutsBench

Benchmarks

MetaToolToolLLMToolBenchAPI-BenchToolAlpacaAPI-BankToolQAToolLens