Overview
Production Readiness
0.7
Novelty Score
0.8
Cost Impact Score
0.65
Citation Count
1
Why It Matters For Business
ShortcutsBench tests end-to-end agent behaviors (API choice, parameter filling, asking for inputs) on real user workflows, revealing practical failure points that matter for automation reliability and cost-effective model selection.
Summary TLDR
ShortcutsBench is a large public benchmark built from real Apple Shortcuts (88 apps, 1,414 APIs, 7,627 user-driven shortcuts). It provides human-annotated multi-step action sequences, exact parameter values (including enums and previous-action outputs), and checks whether agents ask for missing system/user inputs. Evaluations of 10 leading LLMs show API selection is the main bottleneck, parameter filling is a secondary issue for smaller models, and agents generally fail to ask for required inputs. All data and code are released on GitHub.
Problem Statement
Current tool-use benchmarks are too synthetic or small to separate modern LLM agents. They either use few or hand-crafted APIs, short action sequences, or omit parameter-filling and 'ask-for-input' behaviors. This hides weaknesses of stronger LLMs and fails to evaluate real end-to-end behavior for API-based agents.
Main Contribution
Built ShortcutsBench: 7,627 real shortcuts from Apple Shortcuts, covering 88 apps and 1,414 real APIs with human-annotated action sequences and precise parameter values.
Defined three focused evaluation tasks: API selection, API parameter value filling (primitives, enums, prior-action outputs), and recognition of required system/user input.
Evaluated 10 leading LLMs (5 open-source, 5 closed-source) and additional fine-tuned agent models, revealing practical failure modes and releasing all code, data, and logs.
Key Findings
ShortcutsBench scale and realism: 88 apps, 1,414 APIs, 7,627 shortcuts.
API selection accuracy falls sharply with task difficulty.
Agents rarely ask for required system/user inputs.
Open-source large models match closed-source on simple tasks but lag on hard ones.
Parameter extraction from user queries is harder than reusing previous-action outputs.
Results
Dataset scale
Avg actions per shortcut (evaluation set)
Accuracy
Recognition of need for input (range across models)
Who Should Care
What To Try In 7 Days
Run ShortcutsBench on your agent to see if it fails API selection before tuning parameter-fill modules.
Add an explicit 'missing-input detector' that scans parameters for Ask/Clipboard/ExtensionInput/CurrentDate and prompts the user.
If you use open-source models ≥70B, validate on mid/high complexity workflows—they may match closed models on simple tasks but fail on complex ones.
Agent Features
Memory
- short-term history passed as context (no external retrieval in eval)
Planning
- next-action prediction (stepwise API selection)
- multi-step workflow planning
Tool Use
- API function calling (AppIntents/SiriKit)
- parameter filling including enums and outputs from prior actions
Frameworks
- ReACT-style prompting
- Apple Shortcuts API definitions (.actionsdata, .intentdefinition, WFActions.json)
Is Agentic
true
Architectures
- LLM-based agent (prompt + tool calls)
Optimization Features
Token Efficiency
- context-length management by limiting APIs in context (x×|APIs_i| clipping)
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Dataset is Apple Shortcuts–centric; apps/APIs reflect Apple ecosystem only.
- Evaluation excludes shortcuts longer than 30 actions and use of runworkflow calls.
- Complex non-text outputs (PDF, rich text) are not fully serialised for end-to-end result checking.
- Many queries were generated with GPT-4o which can introduce subtle generation bias.
When Not To Use
- You need benchmarks for Android or non-Apple ecosystems.
- You require end-to-end final-result validation involving complex binary formats.
- Your workflows routinely exceed 30 actions or rely on nested shortcut execution.
Failure Modes
- Agent chooses the wrong API even when API is in context (API selection failure).
- Agent fails to extract explicit parameters from natural queries (parameter extraction error).
- Agent fails to prompt user or system for required inputs marked as Ask/Clipboard/ExtensionInput (missing-input awareness).
- Benchmark query generation uses LLMs, so some queries may reflect generator bias.
Core Entities
Models
- Gemini-1.5-Pro
- Gemini-1.5-Flash
- QWen-2-72B
- QWen-2-57B
- LLaMA-3-70B
- Deepspeed-2-chat
- Deepspeed-2-coder
- GPT-4o-mini
- GPT-3.5-turbo
- ChatGLM-4-Air
- AgentLM
- xLAM
- Lemur-70B-Chat-V1
Metrics
- Accuracy
Datasets
- ShortcutsBench
Benchmarks
- MetaTool
- ToolLLM
- ToolBench
- API-Bench
- ToolAlpaca
- API-Bank
- ToolQA
- ToolLens

