Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
Splitting UI automation into planning, decision, and reflection raises task success for real-device workflows, reducing manual scripting and improving coverage for multi-app automation and testing.
Summary TLDR
Mobile-Agent-v2 replaces a single MLLM agent with three specialized agents — planning, decision, and reflection — plus a short-term memory and a visual perception module. This splits long interleaved image-text histories into compact task progress (planning), uses screen-aware decisions (decision), and checks outcomes/retries (reflection). On real-device dynamic tests (88 instructions across system/external/multi-app tasks) it raises success rates substantially versus the single-agent Mobile-Agent; multi-app SR and CR improved by 37.5% and 44.2%, and advanced-task SR rose from ~20% to ~55% in one setting. Manual operation “knowledge injection” can push results near 100% on many basic tasks.
Problem Statement
Single-agent MLLM approaches struggle on mobile UI tasks for two reasons: (1) task progress navigation — tracking which substeps are already done in a long, interleaved image+text history, and (2) focus content navigation — retrieving task-relevant text/images from prior screens. Long sequence length and mixed modalities reduce single-agent effectiveness, causing wrong or ineffective taps.
Main Contribution
A three-role multi-agent architecture (planning, decision, reflection) that splits responsibilities to reduce context length and improve navigation.
A short-term memory unit that stores task-relevant focus content extracted from screens for later use.
A reflection agent that compares before/after screenshots to detect erroneous or ineffective actions and triggers retries.
A practical visual perception module (OCR and icon detection) and a dynamic real-device evaluation across OSes and languages.
Key Findings
Multi-agent design raises task completion vs single-agent Mobile-Agent.
Hard (advanced) instructions improved sharply under Mobile-Agent-v2.
Multi-app scenarios benefit the most from multi-agent memory + reflection.
Manual operation knowledge (hints/tutorials) can push performance to near-perfect on many tasks.
Agent role design matters: planning agent is most impactful.
Results
Success Rate (SR)
Advanced-task Success Rate (SR)
Multi-app SR / CR
Accuracy
Who Should Care
What To Try In 7 Days
Build a simple three-role prototype: one module to summarize progress, one to act on screenshots, one to check outcomes.
Add a short-term memory that writes task-relevant texts/icons from screens and read it before actions.
Run 20 representative multi-app tasks on a test phone and compare SR/CR to your current single-agent or scripted baseline.
Agent Features
Memory
- short-term focus content memory (stores task-relevant screen facts)
Planning
- task progress summarization (text-only planning agent)
Tool Use
- OCR (ConvNextViT-document)
- icon detection (GroundingDINO)
- icon description (Qwen-VL-Int4)
- ADB for execution
Frameworks
- Mobile-Agent-v2
Is Agentic
true
Architectures
- planning-decision-reflection multi-agent
Collaboration
- text-passing between agents (planning -> decision -> reflection)
Optimization Features
System Optimization
- context reduction via planning agent
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on closed-source LLM APIs (GPT-4/GPT-4V) and external visual models.
- Evaluation used a fixed set of 88 instructions; broader app diversity untested.
- Performance still benefits from manual knowledge injection, indicating brittleness on some steps.
When Not To Use
- When you must run fully on-device without cloud LLMs or network access.
- If only trivial single-step UI tasks are needed (overhead not justified).
- When you lack reliable screen OCR or icon detection for your UI styles.
Failure Modes
- Hallucinated or wrong-tap actions leading to unrelated pages.
- Ineffective operations that cause no page change and must be retried.
- Memory can store incorrect focus content if perception errs.
- High API cost and latency when using large cloud LLMs for each agent call.
Core Entities
Models
- GPT-4
- GPT-4V
- Qwen-VL-Int4
- ConvNextViT-document
- GroundingDINO
Metrics
- SR
- CR
- DA
- RA
Benchmarks
- dynamic real-device evaluation (88 instructions across apps and languages)
Context Entities
Models
- GPT-4
- GPT-4V
- Gemini-1.5-Pro
- Qwen-VL-Max
Metrics
- SR
- CR
- DA
- RA
Benchmarks
- custom dynamic evaluation (system and external apps, multi-app tasks)

