Overview
The system is a working prototype evaluated on a 10-app benchmark with concrete metrics and cases; promising but tied to GPT-4V and Android tests.
Citations11
Evidence Strength0.70
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
You can automate mobile UI flows without OS hooks or XML access. That lowers integration cost for cross-device automation, testing, and accessibility tools and works on devices where system metadata is unavailable.
Who Should Care
Summary TLDR
This paper presents Mobile-Agent, a vision-first autonomous agent that operates Android apps using only screenshots. It combines GPT-4V for planning with dedicated OCR and icon detectors (Grounding DINO + CLIP) to map textual and icon instructions to screen coordinates. Mobile-Agent plans step-by-step, uses a ReAct-like prompt format, and self-reflects to correct wrong operations. The authors release Mobile-Eval, a 10-app benchmark with three difficulty levels, and report ~91% success on simple tasks and ~82% on harder tasks, with per-step correctness near 80%. Code and models are open-sourced.
Problem Statement
Existing multimodal LLMs struggle to reliably localize tap targets on mobile screens. Prior fixes require access to app XML/HTML or system metadata, which is not always available. The need is a vision-only agent that can localize and operate apps from screenshots without system-level access.
Main Contribution
Mobile-Agent: a vision-centric mobile agent that operates apps from screenshots using OCR, icon detection, and GPT-4V planning with self-reflection.
Mobile-Eval: a benchmark of 10 common mobile apps with three difficulty levels and multi-app tasks to evaluate mobile agents.
Key Findings
High task success on simple app instructions
Robust per-step correctness across tasks
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Success (Instruction type 1 average) | 0.91 | — | — | Mobile-Eval (simple instructions) | Table 2 reports SU avg = 0.91 for Instruction1 | Table 2 |
| Success (Instruction type 2 average) | 0.82 | — | — | Mobile-Eval (medium difficulty) | Table 2 reports SU avg = 0.82 for Instruction2 | Table 2 |
What To Try In 7 Days
Run the open-source Mobile-Agent repo on one Android device and reproduce 2 simple Mobile-Eval tasks.
Swap in your app screenshots and test OCR + icon detection stability on key screens.
Measure step counts vs human flows to estimate efficiency gains for a target task.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Reproducibility
Risks & Boundaries
Limitations
Relies on GPT-4V for planning; its localization weaknesses still require external detectors.
OCR or icon detection failures cause wrong taps or require manual disambiguation.
When Not To Use
Apps that block screenshots or hide UI elements for security.
Highly dynamic UIs where icons/text move rapidly in real time.
Failure Modes
OCR misses or misreads text, causing no match or wrong clicks.
Multiple identical text instances lead to ambiguity and extra human-like selection steps.

