Mobile-Agent: operate mobile apps from screenshots using visual perception

Overview

Decision SnapshotNeeds Validation

The system is a working prototype evaluated on a 10-app benchmark with concrete metrics and cases; promising but tied to GPT-4V and Android tests.

Citations11

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can automate mobile UI flows without OS hooks or XML access. That lowers integration cost for cross-device automation, testing, and accessibility tools and works on devices where system metadata is unavailable.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper presents Mobile-Agent, a vision-first autonomous agent that operates Android apps using only screenshots. It combines GPT-4V for planning with dedicated OCR and icon detectors (Grounding DINO + CLIP) to map textual and icon instructions to screen coordinates. Mobile-Agent plans step-by-step, uses a ReAct-like prompt format, and self-reflects to correct wrong operations. The authors release Mobile-Eval, a 10-app benchmark with three difficulty levels, and report ~91% success on simple tasks and ~82% on harder tasks, with per-step correctness near 80%. Code and models are open-sourced.

Problem Statement

Existing multimodal LLMs struggle to reliably localize tap targets on mobile screens. Prior fixes require access to app XML/HTML or system metadata, which is not always available. The need is a vision-only agent that can localize and operate apps from screenshots without system-level access.

Main Contribution

Mobile-Agent: a vision-centric mobile agent that operates apps from screenshots using OCR, icon detection, and GPT-4V planning with self-reflection.

Mobile-Eval: a benchmark of 10 common mobile apps with three difficulty levels and multi-app tasks to evaluate mobile agents.

Key Findings

High task success on simple app instructions

NumbersSuccess (Instruction1) = 0.91

Practical UseFor straightforward automations, a vision-only agent can complete ~91% of tasks without system hooks.

Evidence RefTable 2; Section 3.2

Robust per-step correctness across tasks

NumbersProcess Score (avg) ≈ 0.89 / 0.77 / 0.84

Practical UseMost individual actions are correct; you can rely on the agent for multi-step flows but expect occasional wrong clicks.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Success (Instruction type 1 average)	0.91	—	—	Mobile-Eval (simple instructions)	Table 2 reports SU avg = 0.91 for Instruction1	Table 2
Success (Instruction type 2 average)	0.82	—	—	Mobile-Eval (medium difficulty)	Table 2 reports SU avg = 0.82 for Instruction2	Table 2

What To Try In 7 Days

Run the open-source Mobile-Agent repo on one Android device and reproduce 2 simple Mobile-Eval tasks.

Swap in your app screenshots and test OCR + icon detection stability on key screens.

Measure step counts vs human flows to estimate efficiency gains for a target task.

Agent Features

Memory

Short-term operation history (used to track progress and mistakes)

Planning

Iterative self-planning (step-by-step)ReAct-style prompt loop (Observation/Thought/Action)

Tool Use

OCR (text localization)Grounding DINO (icon detection)CLIP (icon-description matching)

Frameworks

ReAct-style prompting

Is Agentic

Yes

Architectures

GPT-4V (MLLM)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/X-PLUG/MobileAgent

Data URLs

https://github.com/X-PLUG/MobileAgent

Risks & Boundaries

Limitations

Relies on GPT-4V for planning; its localization weaknesses still require external detectors.

OCR or icon detection failures cause wrong taps or require manual disambiguation.

When Not To Use

Apps that block screenshots or hide UI elements for security.

Highly dynamic UIs where icons/text move rapidly in real time.

Failure Modes

OCR misses or misreads text, causing no match or wrong clicks.

Multiple identical text instances lead to ambiguity and extra human-like selection steps.

Core Entities

Models

GPT-4VGrounding DINOCLIP

Metrics

Success (Su)Process Score (PS)Relative Efficiency (RE)Completion Rate (CR)

Datasets

Mobile-Eval (introduced by paper)

Benchmarks

Mobile-Eval

Context Entities

Models

GPT-4V (planning)CLIP (icon matching)

Metrics

Su, PS, RE, CR

Datasets

Mobile-Eval (app tasks)

Benchmarks

Mobile-Eval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

High task success on simple app instructions

Robust per-step correctness across tasks

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

Key finding

R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

Key finding

A single LLM can role-play homogeneous multi-agent workflows and cut inference cost via KV-cache reuse

Key finding

DeceptGuard: detect agent deception by reading CoT text and activation probes

Key finding