ShortcutsBench: a realistic Apple Shortcuts dataset to stress-test API-based agents

June 28, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.8

Cost Impact Score

0.65

Citation Count

1

Authors

Haiyang Shen, Yue Li, Desong Meng, Dongqi Cai, Sheng Qi, Li Zhang, Mengwei Xu, Yun Ma

Links

Abstract / PDF

Why It Matters For Business

ShortcutsBench tests end-to-end agent behaviors (API choice, parameter filling, asking for inputs) on real user workflows, revealing practical failure points that matter for automation reliability and cost-effective model selection.

Summary TLDR

ShortcutsBench is a large public benchmark built from real Apple Shortcuts (88 apps, 1,414 APIs, 7,627 user-driven shortcuts). It provides human-annotated multi-step action sequences, exact parameter values (including enums and previous-action outputs), and checks whether agents ask for missing system/user inputs. Evaluations of 10 leading LLMs show API selection is the main bottleneck, parameter filling is a secondary issue for smaller models, and agents generally fail to ask for required inputs. All data and code are released on GitHub.

Problem Statement

Current tool-use benchmarks are too synthetic or small to separate modern LLM agents. They either use few or hand-crafted APIs, short action sequences, or omit parameter-filling and 'ask-for-input' behaviors. This hides weaknesses of stronger LLMs and fails to evaluate real end-to-end behavior for API-based agents.

Main Contribution

Built ShortcutsBench: 7,627 real shortcuts from Apple Shortcuts, covering 88 apps and 1,414 real APIs with human-annotated action sequences and precise parameter values.

Defined three focused evaluation tasks: API selection, API parameter value filling (primitives, enums, prior-action outputs), and recognition of required system/user input.

Evaluated 10 leading LLMs (5 open-source, 5 closed-source) and additional fine-tuned agent models, revealing practical failure modes and releasing all code, data, and logs.

Key Findings

ShortcutsBench scale and realism: 88 apps, 1,414 APIs, 7,627 shortcuts.

Numbers88 apps; 1,414 APIs; 7,627 shortcuts (Table 2, Sec.3.1)

API selection accuracy falls sharply with task difficulty.

NumbersAverage accuracy dropped 19% from difficulty (0,1] to (1,5] and 46% from (0,1] to (5,15] (Sec.4.2)

Agents rarely ask for required system/user inputs.

NumbersRecognition accuracy ranges ≈30.6% to 55.2% across models (Table 4)

Open-source large models match closed-source on simple tasks but lag on hard ones.

NumbersOpen-source ≥70B match closed-source on first 3 difficulty levels but underperform at highest complexity (Sec.4.2)

Parameter extraction from user queries is harder than reusing previous-action outputs.

NumbersParameter-fill errors concentrated on extracting primitives/enums vs. smaller drops when using prior outputs (Sec.4.2; 6

Results

Dataset scale

Value88 apps; 1,414 APIs; 7,627 shortcuts

Avg actions per shortcut (evaluation set)

Value8.34 actions (overall); 21.62 in full dataset average

Accuracy

Value−19% from (0,1] to (1,5]; −46% from (0,1] to (5,15]

Recognition of need for input (range across models)

Value30.55% to 55.18% accuracy

Who Should Care

What To Try In 7 Days

Run ShortcutsBench on your agent to see if it fails API selection before tuning parameter-fill modules.

Add an explicit 'missing-input detector' that scans parameters for Ask/Clipboard/ExtensionInput/CurrentDate and prompts the user.

If you use open-source models ≥70B, validate on mid/high complexity workflows—they may match closed models on simple tasks but fail on complex ones.

Agent Features

Memory

  • short-term history passed as context (no external retrieval in eval)

Planning

  • next-action prediction (stepwise API selection)
  • multi-step workflow planning

Tool Use

  • API function calling (AppIntents/SiriKit)
  • parameter filling including enums and outputs from prior actions

Frameworks

  • ReACT-style prompting
  • Apple Shortcuts API definitions (.actionsdata, .intentdefinition, WFActions.json)

Is Agentic

true

Architectures

  • LLM-based agent (prompt + tool calls)

Optimization Features

Token Efficiency

  • context-length management by limiting APIs in context (x×|APIs_i| clipping)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Dataset is Apple Shortcuts–centric; apps/APIs reflect Apple ecosystem only.
  • Evaluation excludes shortcuts longer than 30 actions and use of runworkflow calls.
  • Complex non-text outputs (PDF, rich text) are not fully serialised for end-to-end result checking.
  • Many queries were generated with GPT-4o which can introduce subtle generation bias.

When Not To Use

  • You need benchmarks for Android or non-Apple ecosystems.
  • You require end-to-end final-result validation involving complex binary formats.
  • Your workflows routinely exceed 30 actions or rely on nested shortcut execution.

Failure Modes

  • Agent chooses the wrong API even when API is in context (API selection failure).
  • Agent fails to extract explicit parameters from natural queries (parameter extraction error).
  • Agent fails to prompt user or system for required inputs marked as Ask/Clipboard/ExtensionInput (missing-input awareness).
  • Benchmark query generation uses LLMs, so some queries may reflect generator bias.

Core Entities

Models

  • Gemini-1.5-Pro
  • Gemini-1.5-Flash
  • QWen-2-72B
  • QWen-2-57B
  • LLaMA-3-70B
  • Deepspeed-2-chat
  • Deepspeed-2-coder
  • GPT-4o-mini
  • GPT-3.5-turbo
  • ChatGLM-4-Air
  • AgentLM
  • xLAM
  • Lemur-70B-Chat-V1

Metrics

  • Accuracy

Datasets

  • ShortcutsBench

Benchmarks

  • MetaTool
  • ToolLLM
  • ToolBench
  • API-Bench
  • ToolAlpaca
  • API-Bank
  • ToolQA
  • ToolLens