Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
Mobile-Bench helps teams test phone assistants and automation agents across realistic multi-app flows and shows that APIs speed tasks but require careful selection; invest in hybrid API+UI support and robust process checks.
Summary TLDR
Mobile-Bench is a new platform, dataset, and metric to evaluate LLM-driven mobile agents that interact with phones using both UI actions and API calls. The dataset has 832 test queries (SAST 332, SAMT 300, MAMT 200), covers 29 apps and 103 APIs, and focuses on multi-app planning and sequential action checks. The authors introduce CheckPoint, a process-focused metric (package, key phrase, API) and test several LLMs (GPT-3.5, GPT-4, LLaMA sizes). Key findings: APIs can speed execution but hurt coverage if misused; planning and an observation->thought->action loop are essential; GPT-4 achieves highest PassRate on simple tasks (SAST 80.96%) but drops on multi-app tasks (MAMT 26.5%). Code and the
Problem Statement
Existing benchmarks either focus on UI steps or single apps, are slow to evaluate real mobile flows, and lack process-aware metrics for sequential multi-app tasks. This makes it hard to test whether LLM agents can plan across apps, choose APIs versus UI, and follow multi-step mobile workflows.
Main Contribution
A running mobile test platform that supports hybrid UI operations and API calls.
A 832-entry dataset (SAST, SAMT, MAMT) emphasizing multi-app planning and real user queries plus GPT-4–augmented cases.
A process-oriented metric (CheckPoint) that checks package usage, key phrases, and API calls with sequential/conjunctive/disjunctive logic.
Key Findings
Mobile-Bench dataset: 832 cases across three difficulty tiers.
Platform app/API coverage: 29 apps and 103 usable APIs collected.
CheckPoint composition is dominated by semantic key phrases.
APIs speed tasks but removing them reduces success and coverage.
Planning and explicit thought drastically improve success.
Model performance falls sharply with multi-app complexity.
Agents can misjudge completion and prematurely stop.
Results
GPT-4 PassRate (by difficulty)
Average #Steps (GPT-4)
API ablation (GPT-4) effect on PassRate
Thought ablation (GPT-4) effect on PassRate
CheckPoint coverage composition
Who Should Care
What To Try In 7 Days
Run Mobile-Bench on a small set of production tasks to see where APIs save steps and where they fail.
Add a thought/plan step in your agent loop and measure PassRate before/after on simple tasks.
Inspect agent completion logic and add explicit checkpoint checks to avoid premature stopping.
Agent Features
Memory
- short-term action history (used to judge progress)
Planning
- observation->thought->plan->action loop
- iterative plan execution with action history
Tool Use
- UI actions (click/input/scroll)
- API calls via ADB
- hybrid API+UI decision-making
Frameworks
- Appium
- Android emulator
- ADB
Is Agentic
true
Collaboration
- multi-APP coordination and app selection
Optimization Features
Token Efficiency
- action-history compression for long MAMT cases
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- LLMs hallucinate or mispredict API calls, confusing app functionality.
- CheckPoint evaluates process coverage but cannot fully assess final outcome quality.
- Benchmark needs broad API/SDK support; many third-party APIs remain missing.
When Not To Use
- When you need an end-to-end quality score of final outputs rather than process checks.
- When target apps lack accessible APIs or require payment flows (these were filtered).
- For privacy-sensitive tasks using real user data not covered by the released dataset.
Failure Modes
- Hallucinated API calls that lead the agent off-track.
- Premature termination: agent declares success before task completion.
- Greedy exploration: agent stays too long in one app and fails to switch.
- Long action histories exceed context and reduce judgment accuracy.
Core Entities
Models
- GPT-3.5-turbo
- GPT-4
- LLaMA-13B
- LLaMA-70B
Metrics
- PassRate
- CheckPoint l1
- CheckPoint l2
- Average #Steps
Datasets
- Mobile-Bench (SAST, SAMT, MAMT)
Benchmarks
- Mobile-Bench
Context Entities
Datasets
- RICO (context reference)

