Mobile-Agent: operate mobile apps from screenshots using visual perception

January 29, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

11

Authors

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang

Links

Abstract / PDF

Why It Matters For Business

You can automate mobile UI flows without OS hooks or XML access. That lowers integration cost for cross-device automation, testing, and accessibility tools and works on devices where system metadata is unavailable.

Summary TLDR

This paper presents Mobile-Agent, a vision-first autonomous agent that operates Android apps using only screenshots. It combines GPT-4V for planning with dedicated OCR and icon detectors (Grounding DINO + CLIP) to map textual and icon instructions to screen coordinates. Mobile-Agent plans step-by-step, uses a ReAct-like prompt format, and self-reflects to correct wrong operations. The authors release Mobile-Eval, a 10-app benchmark with three difficulty levels, and report ~91% success on simple tasks and ~82% on harder tasks, with per-step correctness near 80%. Code and models are open-sourced.

Problem Statement

Existing multimodal LLMs struggle to reliably localize tap targets on mobile screens. Prior fixes require access to app XML/HTML or system metadata, which is not always available. The need is a vision-only agent that can localize and operate apps from screenshots without system-level access.

Main Contribution

Mobile-Agent: a vision-centric mobile agent that operates apps from screenshots using OCR, icon detection, and GPT-4V planning with self-reflection.

Mobile-Eval: a benchmark of 10 common mobile apps with three difficulty levels and multi-app tasks to evaluate mobile agents.

An evaluation showing strong per-step accuracy and high completion rates on Mobile-Eval and open-sourced code and models.

Key Findings

High task success on simple app instructions

NumbersSuccess (Instruction1) = 0.91

Robust per-step correctness across tasks

NumbersProcess Score (avg) ≈ 0.89 / 0.77 / 0.84

Operates near human step-efficiency on average

NumbersRelative Efficiency ≈ 4.9/4.2, 7.9/6.3, 7.5/6.2 (Mobile/Human)

Self-reflection corrects errors

Results

Success (Instruction type 1 average)

Value0.91

Success (Instruction type 2 average)

Value0.82

Success (Instruction type 3 average)

Value0.82

Process Score (per-step correctness)

Value0.89 / 0.77 / 0.84

Relative Efficiency (Mobile / Human steps)

Value4.9/4.2 ; 7.9/6.3 ; 7.5/6.2

Baselinehuman

Completion Rate (CR average)

Value98.2% / 90.9% / 91.3%

Who Should Care

What To Try In 7 Days

Run the open-source Mobile-Agent repo on one Android device and reproduce 2 simple Mobile-Eval tasks.

Swap in your app screenshots and test OCR + icon detection stability on key screens.

Measure step counts vs human flows to estimate efficiency gains for a target task.

Agent Features

Memory

  • Short-term operation history (used to track progress and mistakes)

Planning

  • Iterative self-planning (step-by-step)
  • ReAct-style prompt loop (Observation/Thought/Action)

Tool Use

  • OCR (text localization)
  • Grounding DINO (icon detection)
  • CLIP (icon-description matching)

Frameworks

  • ReAct-style prompting

Is Agentic

true

Architectures

  • GPT-4V (MLLM)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Relies on GPT-4V for planning; its localization weaknesses still require external detectors.
  • OCR or icon detection failures cause wrong taps or require manual disambiguation.
  • Evaluation only on Android; other OS behavior not validated.
  • Some instructions produced invalid steps before correction (PS < 1).

When Not To Use

  • Apps that block screenshots or hide UI elements for security.
  • Highly dynamic UIs where icons/text move rapidly in real time.
  • Settings requiring privileged system calls rather than visible UI actions.

Failure Modes

  • OCR misses or misreads text, causing no match or wrong clicks.
  • Multiple identical text instances lead to ambiguity and extra human-like selection steps.
  • Icon matching errors when visual attributes are ambiguous.
  • Dependence on third-party LLM availability and latency (GPT-4V).

Core Entities

Models

  • GPT-4V
  • Grounding DINO
  • CLIP

Metrics

  • Success (Su)
  • Process Score (PS)
  • Relative Efficiency (RE)
  • Completion Rate (CR)

Datasets

  • Mobile-Eval (introduced by paper)

Benchmarks

  • Mobile-Eval

Context Entities

Models

  • GPT-4V (planning)
  • CLIP (icon matching)

Metrics

  • Su, PS, RE, CR

Datasets

  • Mobile-Eval (app tasks)

Benchmarks

  • Mobile-Eval