Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
11
Why It Matters For Business
You can automate mobile UI flows without OS hooks or XML access. That lowers integration cost for cross-device automation, testing, and accessibility tools and works on devices where system metadata is unavailable.
Summary TLDR
This paper presents Mobile-Agent, a vision-first autonomous agent that operates Android apps using only screenshots. It combines GPT-4V for planning with dedicated OCR and icon detectors (Grounding DINO + CLIP) to map textual and icon instructions to screen coordinates. Mobile-Agent plans step-by-step, uses a ReAct-like prompt format, and self-reflects to correct wrong operations. The authors release Mobile-Eval, a 10-app benchmark with three difficulty levels, and report ~91% success on simple tasks and ~82% on harder tasks, with per-step correctness near 80%. Code and models are open-sourced.
Problem Statement
Existing multimodal LLMs struggle to reliably localize tap targets on mobile screens. Prior fixes require access to app XML/HTML or system metadata, which is not always available. The need is a vision-only agent that can localize and operate apps from screenshots without system-level access.
Main Contribution
Mobile-Agent: a vision-centric mobile agent that operates apps from screenshots using OCR, icon detection, and GPT-4V planning with self-reflection.
Mobile-Eval: a benchmark of 10 common mobile apps with three difficulty levels and multi-app tasks to evaluate mobile agents.
An evaluation showing strong per-step accuracy and high completion rates on Mobile-Eval and open-sourced code and models.
Key Findings
High task success on simple app instructions
Robust per-step correctness across tasks
Operates near human step-efficiency on average
Self-reflection corrects errors
Results
Success (Instruction type 1 average)
Success (Instruction type 2 average)
Success (Instruction type 3 average)
Process Score (per-step correctness)
Relative Efficiency (Mobile / Human steps)
Completion Rate (CR average)
Who Should Care
What To Try In 7 Days
Run the open-source Mobile-Agent repo on one Android device and reproduce 2 simple Mobile-Eval tasks.
Swap in your app screenshots and test OCR + icon detection stability on key screens.
Measure step counts vs human flows to estimate efficiency gains for a target task.
Agent Features
Memory
- Short-term operation history (used to track progress and mistakes)
Planning
- Iterative self-planning (step-by-step)
- ReAct-style prompt loop (Observation/Thought/Action)
Tool Use
- OCR (text localization)
- Grounding DINO (icon detection)
- CLIP (icon-description matching)
Frameworks
- ReAct-style prompting
Is Agentic
true
Architectures
- GPT-4V (MLLM)
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Relies on GPT-4V for planning; its localization weaknesses still require external detectors.
- OCR or icon detection failures cause wrong taps or require manual disambiguation.
- Evaluation only on Android; other OS behavior not validated.
- Some instructions produced invalid steps before correction (PS < 1).
When Not To Use
- Apps that block screenshots or hide UI elements for security.
- Highly dynamic UIs where icons/text move rapidly in real time.
- Settings requiring privileged system calls rather than visible UI actions.
Failure Modes
- OCR misses or misreads text, causing no match or wrong clicks.
- Multiple identical text instances lead to ambiguity and extra human-like selection steps.
- Icon matching errors when visual attributes are ambiguous.
- Dependence on third-party LLM availability and latency (GPT-4V).
Core Entities
Models
- GPT-4V
- Grounding DINO
- CLIP
Metrics
- Success (Su)
- Process Score (PS)
- Relative Efficiency (RE)
- Completion Rate (CR)
Datasets
- Mobile-Eval (introduced by paper)
Benchmarks
- Mobile-Eval
Context Entities
Models
- GPT-4V (planning)
- CLIP (icon matching)
Metrics
- Su, PS, RE, CR
Datasets
- Mobile-Eval (app tasks)
Benchmarks
- Mobile-Eval

