Mobile-Bench: a platform and dataset to test mobile LLM agents that use both UI actions and APIs with a CheckPoint process metric

Overview

Decision SnapshotNeeds Validation

The platform and dataset are a useful prototype for hybrid API+UI evaluation; experiments across multiple LLMs back claims, but practical deployment needs richer APIs, robustness fixes, and fine-tuning.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Rui Yan, Shuo Shang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Mobile-Bench helps teams test phone assistants and automation agents across realistic multi-app flows and shows that APIs speed tasks but require careful selection; invest in hybrid API+UI support and robust process checks.

Who Should Care

ML Engineer Product Manager Engineering Lead CTO

Summary TLDR

Mobile-Bench is a new platform, dataset, and metric to evaluate LLM-driven mobile agents that interact with phones using both UI actions and API calls. The dataset has 832 test queries (SAST 332, SAMT 300, MAMT 200), covers 29 apps and 103 APIs, and focuses on multi-app planning and sequential action checks. The authors introduce CheckPoint, a process-focused metric (package, key phrase, API) and test several LLMs (GPT-3.5, GPT-4, LLaMA sizes). Key findings: APIs can speed execution but hurt coverage if misused; planning and an observation->thought->action loop are essential; GPT-4 achieves highest PassRate on simple tasks (SAST 80.96%) but drops on multi-app tasks (MAMT 26.5%). Code and the

Problem Statement

Existing benchmarks either focus on UI steps or single apps, are slow to evaluate real mobile flows, and lack process-aware metrics for sequential multi-app tasks. This makes it hard to test whether LLM agents can plan across apps, choose APIs versus UI, and follow multi-step mobile workflows.

Main Contribution

A running mobile test platform that supports hybrid UI operations and API calls.

A 832-entry dataset (SAST, SAMT, MAMT) emphasizing multi-app planning and real user queries plus GPT-4–augmented cases.

Key Findings

Mobile-Bench dataset: 832 cases across three difficulty tiers.

NumbersSAST 332, SAMT 300, MAMT 200

Practical UseUse this dataset to test agents on single-app, multi-task single-app, and multi-app scenarios representative of real voice queries.

Evidence RefSection 3.1; Figure 3(a)

Platform app/API coverage: 29 apps and 103 usable APIs collected.

Numbers29 apps; 103 APIs

Practical UseBenchmarks can test API+UI tradeoffs; expect to need similar app/API breadth to reproduce results.

Evidence RefSection 3.1; Dataset statistics

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GPT-4 PassRate (by difficulty)	SAST 80.96%, SAMT 63%, MAMT 26.5%	—	—	Mobile-Bench - Table 4	Table 4 reports per-tier PassRate for GPT-4	Table 4
Average #Steps (GPT-4)	SAST 3.79, SAMT 13.94, MAMT 44.86	—	—	Mobile-Bench - Table 4	Table 4 Average #Steps for GPT-4	Table 4

What To Try In 7 Days

Run Mobile-Bench on a small set of production tasks to see where APIs save steps and where they fail.

Add a thought/plan step in your agent loop and measure PassRate before/after on simple tasks.

Inspect agent completion logic and add explicit checkpoint checks to avoid premature stopping.

Agent Features

Memory

short-term action history (used to judge progress)

Planning

observation->thought->plan->action loopiterative plan execution with action history

Tool Use

UI actions (click/input/scroll)API calls via ADBhybrid API+UI decision-making

Frameworks

AppiumAndroid emulatorADB

Is Agentic

Yes

Collaboration

multi-APP coordination and app selection

Optimization Features

Token Efficiency

action-history compression for long MAMT cases

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/XiaoMi/MobileBench

Data URLs

https://github.com/XiaoMi/MobileBench

Risks & Boundaries

Limitations

LLMs hallucinate or mispredict API calls, confusing app functionality.

CheckPoint evaluates process coverage but cannot fully assess final outcome quality.

When Not To Use

When you need an end-to-end quality score of final outputs rather than process checks.

When target apps lack accessible APIs or require payment flows (these were filtered).

Failure Modes

Hallucinated API calls that lead the agent off-track.

Premature termination: agent declares success before task completion.

Core Entities

Models

GPT-3.5-turboGPT-4LLaMA-13BLLaMA-70B

Metrics

PassRateCheckPoint l1CheckPoint l2Average #Steps

Datasets

Mobile-Bench (SAST, SAMT, MAMT)

Benchmarks

Mobile-Bench

Context Entities

Datasets

RICO (context reference)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Mobile-Bench dataset: 832 cases across three difficulty tiers.

Platform app/API coverage: 29 apps and 103 usable APIs collected.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding