ShortcutsBench: a realistic Apple Shortcuts dataset to stress-test API-based agents

Overview

Decision SnapshotNeeds Validation

ShortcutsBench is practical and novel: it uses real, human-created workflows and reveals concrete agent failures (API choice, parameter extraction, ask-for-input). Evidence includes dataset scale and multi-model evaluations reported in the paper.

Citations1

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 65%

Production readiness: 70%

Novelty: 80%

Authors

Haiyang Shen, Yue Li, Desong Meng, Dongqi Cai, Sheng Qi, Li Zhang, Mengwei Xu, Yun Ma

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ShortcutsBench tests end-to-end agent behaviors (API choice, parameter filling, asking for inputs) on real user workflows, revealing practical failure points that matter for automation reliability and cost-effective model selection.

Who Should Care

ML Engineer Product Manager CTO Data Scientist

Summary TLDR

ShortcutsBench is a large public benchmark built from real Apple Shortcuts (88 apps, 1,414 APIs, 7,627 user-driven shortcuts). It provides human-annotated multi-step action sequences, exact parameter values (including enums and previous-action outputs), and checks whether agents ask for missing system/user inputs. Evaluations of 10 leading LLMs show API selection is the main bottleneck, parameter filling is a secondary issue for smaller models, and agents generally fail to ask for required inputs. All data and code are released on GitHub.

Problem Statement

Current tool-use benchmarks are too synthetic or small to separate modern LLM agents. They either use few or hand-crafted APIs, short action sequences, or omit parameter-filling and 'ask-for-input' behaviors. This hides weaknesses of stronger LLMs and fails to evaluate real end-to-end behavior for API-based agents.

Main Contribution

Built ShortcutsBench: 7,627 real shortcuts from Apple Shortcuts, covering 88 apps and 1,414 real APIs with human-annotated action sequences and precise parameter values.

Defined three focused evaluation tasks: API selection, API parameter value filling (primitives, enums, prior-action outputs), and recognition of required system/user input.

Key Findings

ShortcutsBench scale and realism: 88 apps, 1,414 APIs, 7,627 shortcuts.

Numbers88 apps; 1,414 APIs; 7,627 shortcuts (Table 2, Sec.3.1)

Practical UseUse ShortcutsBench to stress-test agents on realistic, multi-step API workflows rather than toy or hand-crafted toolsets.

Evidence RefTable 2; Sec.3.1

API selection accuracy falls sharply with task difficulty.

NumbersAverage accuracy dropped 19% from difficulty (0,1] to (1,5] and 46% from (0,1] to (5,15] (Sec.4.2)

Practical UsePrioritize improving agent planning and API-choice logic before focusing on fine-grained parameter handling for complex tasks.

Evidence RefSec.4.2; Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset scale	88 apps; 1,414 APIs; 7,627 shortcuts	—	—	ShortcutsBench	Table 2; Sec.3.1	Table 2
Avg actions per shortcut (evaluation set)	8.34 actions (overall); 21.62 in full dataset average	—	—	Table 3; Table 2	Table 3; Table 2	Table 3; Table 2

What To Try In 7 Days

Run ShortcutsBench on your agent to see if it fails API selection before tuning parameter-fill modules.

Add an explicit 'missing-input detector' that scans parameters for Ask/Clipboard/ExtensionInput/CurrentDate and prompts the user.

If you use open-source models ≥70B, validate on mid/high complexity workflows—they may match closed models on simple tasks but fail on complex ones.

Agent Features

Memory

short-term history passed as context (no external retrieval in eval)

Planning

next-action prediction (stepwise API selection)multi-step workflow planning

Tool Use

API function calling (AppIntents/SiriKit)parameter filling including enums and outputs from prior actions

Frameworks

ReACT-style promptingApple Shortcuts API definitions (.actionsdata, .intentdefinition, WFActions.json)

Is Agentic

Yes

Architectures

LLM-based agent (prompt + tool calls)

Optimization Features

Token Efficiency

context-length management by limiting APIs in context (x×|APIs_i| clipping)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/EachSheep/ShortcutsBench

Data URLs

https://github.com/EachSheep/ShortcutsBench

Risks & Boundaries

Limitations

Dataset is Apple Shortcuts–centric; apps/APIs reflect Apple ecosystem only.

Evaluation excludes shortcuts longer than 30 actions and use of runworkflow calls.

When Not To Use

You need benchmarks for Android or non-Apple ecosystems.

You require end-to-end final-result validation involving complex binary formats.

Failure Modes

Agent chooses the wrong API even when API is in context (API selection failure).

Agent fails to extract explicit parameters from natural queries (parameter extraction error).

Core Entities

Models

Gemini-1.5-ProGemini-1.5-FlashQWen-2-72BQWen-2-57BLLaMA-3-70BDeepspeed-2-chatDeepspeed-2-coderGPT-4o-miniGPT-3.5-turboChatGLM-4-AirAgentLMxLAMLemur-70B-Chat-V1

Metrics

Accuracy

Datasets

ShortcutsBench

Benchmarks

MetaToolToolLLMToolBenchAPI-BenchToolAlpacaAPI-BankToolQAToolLens

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ShortcutsBench scale and realism: 88 apps, 1,414 APIs, 7,627 shortcuts.

API selection accuracy falls sharply with task difficulty.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding