Mobile-Agent: operate mobile apps from screenshots using visual perception

January 29, 20247 min

Overview

Decision SnapshotNeeds Validation

The system is a working prototype evaluated on a 10-app benchmark with concrete metrics and cases; promising but tied to GPT-4V and Android tests.

Citations11

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can automate mobile UI flows without OS hooks or XML access. That lowers integration cost for cross-device automation, testing, and accessibility tools and works on devices where system metadata is unavailable.

Who Should Care

Summary TLDR

This paper presents Mobile-Agent, a vision-first autonomous agent that operates Android apps using only screenshots. It combines GPT-4V for planning with dedicated OCR and icon detectors (Grounding DINO + CLIP) to map textual and icon instructions to screen coordinates. Mobile-Agent plans step-by-step, uses a ReAct-like prompt format, and self-reflects to correct wrong operations. The authors release Mobile-Eval, a 10-app benchmark with three difficulty levels, and report ~91% success on simple tasks and ~82% on harder tasks, with per-step correctness near 80%. Code and models are open-sourced.

Problem Statement

Existing multimodal LLMs struggle to reliably localize tap targets on mobile screens. Prior fixes require access to app XML/HTML or system metadata, which is not always available. The need is a vision-only agent that can localize and operate apps from screenshots without system-level access.

Main Contribution

Mobile-Agent: a vision-centric mobile agent that operates apps from screenshots using OCR, icon detection, and GPT-4V planning with self-reflection.

Mobile-Eval: a benchmark of 10 common mobile apps with three difficulty levels and multi-app tasks to evaluate mobile agents.

Key Findings

High task success on simple app instructions

NumbersSuccess (Instruction1) = 0.91

Practical UseFor straightforward automations, a vision-only agent can complete ~91% of tasks without system hooks.

Evidence RefTable 2; Section 3.2

Robust per-step correctness across tasks

NumbersProcess Score (avg) ≈ 0.89 / 0.77 / 0.84

Practical UseMost individual actions are correct; you can rely on the agent for multi-step flows but expect occasional wrong clicks.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Success (Instruction type 1 average)0.91Mobile-Eval (simple instructions)Table 2 reports SU avg = 0.91 for Instruction1Table 2
Success (Instruction type 2 average)0.82Mobile-Eval (medium difficulty)Table 2 reports SU avg = 0.82 for Instruction2Table 2

What To Try In 7 Days

Run the open-source Mobile-Agent repo on one Android device and reproduce 2 simple Mobile-Eval tasks.

Swap in your app screenshots and test OCR + icon detection stability on key screens.

Measure step counts vs human flows to estimate efficiency gains for a target task.

Agent Features

Memory
Short-term operation history (used to track progress and mistakes)
Planning
Iterative self-planning (step-by-step)ReAct-style prompt loop (Observation/Thought/Action)
Tool Use
OCR (text localization)Grounding DINO (icon detection)CLIP (icon-description matching)
Frameworks
ReAct-style prompting
Is Agentic

Yes

Architectures
GPT-4V (MLLM)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Relies on GPT-4V for planning; its localization weaknesses still require external detectors.

OCR or icon detection failures cause wrong taps or require manual disambiguation.

When Not To Use

Apps that block screenshots or hide UI elements for security.

Highly dynamic UIs where icons/text move rapidly in real time.

Failure Modes

OCR misses or misreads text, causing no match or wrong clicks.

Multiple identical text instances lead to ambiguity and extra human-like selection steps.

Core Entities

Models

GPT-4VGrounding DINOCLIP

Metrics

Success (Su)Process Score (PS)Relative Efficiency (RE)Completion Rate (CR)

Datasets

Mobile-Eval (introduced by paper)

Benchmarks

Mobile-Eval

Context Entities

Models

GPT-4V (planning)CLIP (icon matching)

Metrics

Su, PS, RE, CR

Datasets

Mobile-Eval (app tasks)

Benchmarks

Mobile-Eval