Overview
Splitting roles reduces input length and concentrates responsibilities; dynamic real-device tests show consistent gains, but results depend on closed APIs (GPT-4/GPT-4V), visual tool quality, and a limited 88-instruction benchmark.
Citations1
Evidence Strength0.80
Confidence0.87
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Splitting UI automation into planning, decision, and reflection raises task success for real-device workflows, reducing manual scripting and improving coverage for multi-app automation and testing.
Who Should Care
Summary TLDR
Mobile-Agent-v2 replaces a single MLLM agent with three specialized agents — planning, decision, and reflection — plus a short-term memory and a visual perception module. This splits long interleaved image-text histories into compact task progress (planning), uses screen-aware decisions (decision), and checks outcomes/retries (reflection). On real-device dynamic tests (88 instructions across system/external/multi-app tasks) it raises success rates substantially versus the single-agent Mobile-Agent; multi-app SR and CR improved by 37.5% and 44.2%, and advanced-task SR rose from ~20% to ~55% in one setting. Manual operation “knowledge injection” can push results near 100% on many basic tasks.
Problem Statement
Single-agent MLLM approaches struggle on mobile UI tasks for two reasons: (1) task progress navigation — tracking which substeps are already done in a long, interleaved image+text history, and (2) focus content navigation — retrieving task-relevant text/images from prior screens. Long sequence length and mixed modalities reduce single-agent effectiveness, causing wrong or ineffective taps.
Main Contribution
A three-role multi-agent architecture (planning, decision, reflection) that splits responsibilities to reduce context length and improve navigation.
A short-term memory unit that stores task-relevant focus content extracted from screens for later use.
Key Findings
Multi-agent design raises task completion vs single-agent Mobile-Agent.
Hard (advanced) instructions improved sharply under Mobile-Agent-v2.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Success Rate (SR) | average +27% vs Mobile-Agent | Mobile-Agent single-agent | +27% (avg across English & non-English) | dynamic real-device eval (88 instructions) | Paper states Mobile-Agent-v2 achieves an average improvement of 27% in success rate (4.3.1) | Tables 1-2 |
| Advanced-task Success Rate (SR) | 55% (Mobile-Agent-v2) vs 20% (Mobile-Agent) | Mobile-Agent single-agent | +35 percentage points | non-English advanced instructions | Text reports 55% vs 20% for advanced instructions (4.3.1) | Table 1 |
What To Try In 7 Days
Build a simple three-role prototype: one module to summarize progress, one to act on screenshots, one to check outcomes.
Add a short-term memory that writes task-relevant texts/icons from screens and read it before actions.
Run 20 representative multi-app tasks on a test phone and compare SR/CR to your current single-agent or scripted baseline.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
System Optimization
Reproducibility
Risks & Boundaries
Limitations
Relies on closed-source LLM APIs (GPT-4/GPT-4V) and external visual models.
Evaluation used a fixed set of 88 instructions; broader app diversity untested.
When Not To Use
When you must run fully on-device without cloud LLMs or network access.
If only trivial single-step UI tasks are needed (overhead not justified).
Failure Modes
Hallucinated or wrong-tap actions leading to unrelated pages.
Ineffective operations that cause no page change and must be retried.

