Overview
The platform and dataset are a useful prototype for hybrid API+UI evaluation; experiments across multiple LLMs back claims, but practical deployment needs richer APIs, robustness fixes, and fine-tuning.
Citations1
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 2/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
Mobile-Bench helps teams test phone assistants and automation agents across realistic multi-app flows and shows that APIs speed tasks but require careful selection; invest in hybrid API+UI support and robust process checks.
Who Should Care
Summary TLDR
Mobile-Bench is a new platform, dataset, and metric to evaluate LLM-driven mobile agents that interact with phones using both UI actions and API calls. The dataset has 832 test queries (SAST 332, SAMT 300, MAMT 200), covers 29 apps and 103 APIs, and focuses on multi-app planning and sequential action checks. The authors introduce CheckPoint, a process-focused metric (package, key phrase, API) and test several LLMs (GPT-3.5, GPT-4, LLaMA sizes). Key findings: APIs can speed execution but hurt coverage if misused; planning and an observation->thought->action loop are essential; GPT-4 achieves highest PassRate on simple tasks (SAST 80.96%) but drops on multi-app tasks (MAMT 26.5%). Code and the
Problem Statement
Existing benchmarks either focus on UI steps or single apps, are slow to evaluate real mobile flows, and lack process-aware metrics for sequential multi-app tasks. This makes it hard to test whether LLM agents can plan across apps, choose APIs versus UI, and follow multi-step mobile workflows.
Main Contribution
A running mobile test platform that supports hybrid UI operations and API calls.
A 832-entry dataset (SAST, SAMT, MAMT) emphasizing multi-app planning and real user queries plus GPT-4–augmented cases.
Key Findings
Mobile-Bench dataset: 832 cases across three difficulty tiers.
Platform app/API coverage: 29 apps and 103 usable APIs collected.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GPT-4 PassRate (by difficulty) | SAST 80.96%, SAMT 63%, MAMT 26.5% | — | — | Mobile-Bench - Table 4 | Table 4 reports per-tier PassRate for GPT-4 | Table 4 |
| Average #Steps (GPT-4) | SAST 3.79, SAMT 13.94, MAMT 44.86 | — | — | Mobile-Bench - Table 4 | Table 4 Average #Steps for GPT-4 | Table 4 |
What To Try In 7 Days
Run Mobile-Bench on a small set of production tasks to see where APIs save steps and where they fail.
Add a thought/plan step in your agent loop and measure PassRate before/after on simple tasks.
Inspect agent completion logic and add explicit checkpoint checks to avoid premature stopping.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Collaboration
Optimization Features
Token Efficiency
Reproducibility
Risks & Boundaries
Limitations
LLMs hallucinate or mispredict API calls, confusing app functionality.
CheckPoint evaluates process coverage but cannot fully assess final outcome quality.
When Not To Use
When you need an end-to-end quality score of final outputs rather than process checks.
When target apps lack accessible APIs or require payment flows (these were filtered).
Failure Modes
Hallucinated API calls that lead the agent off-track.
Premature termination: agent declares success before task completion.

