Split planning, decision and reflection agents to boost mobile UI automation and task completion

June 3, 20248 min

Overview

Decision SnapshotReady For Pilot

Splitting roles reduces input length and concentrates responsibilities; dynamic real-device tests show consistent gains, but results depend on closed APIs (GPT-4/GPT-4V), visual tool quality, and a limited 88-instruction benchmark.

Citations1

Evidence Strength0.80

Confidence0.87

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang

Links

Abstract / PDF / Code

Why It Matters For Business

Splitting UI automation into planning, decision, and reflection raises task success for real-device workflows, reducing manual scripting and improving coverage for multi-app automation and testing.

Who Should Care

Summary TLDR

Mobile-Agent-v2 replaces a single MLLM agent with three specialized agents — planning, decision, and reflection — plus a short-term memory and a visual perception module. This splits long interleaved image-text histories into compact task progress (planning), uses screen-aware decisions (decision), and checks outcomes/retries (reflection). On real-device dynamic tests (88 instructions across system/external/multi-app tasks) it raises success rates substantially versus the single-agent Mobile-Agent; multi-app SR and CR improved by 37.5% and 44.2%, and advanced-task SR rose from ~20% to ~55% in one setting. Manual operation “knowledge injection” can push results near 100% on many basic tasks.

Problem Statement

Single-agent MLLM approaches struggle on mobile UI tasks for two reasons: (1) task progress navigation — tracking which substeps are already done in a long, interleaved image+text history, and (2) focus content navigation — retrieving task-relevant text/images from prior screens. Long sequence length and mixed modalities reduce single-agent effectiveness, causing wrong or ineffective taps.

Main Contribution

A three-role multi-agent architecture (planning, decision, reflection) that splits responsibilities to reduce context length and improve navigation.

A short-term memory unit that stores task-relevant focus content extracted from screens for later use.

Key Findings

Multi-agent design raises task completion vs single-agent Mobile-Agent.

Numbersaverage SR +27% (across English/Chinese evals)

Practical UseUse a planning/decision/reflection split to lift overall automation success on real devices.

Evidence Ref4.3.1 / Tables 1-2

Hard (advanced) instructions improved sharply under Mobile-Agent-v2.

Numbersadvanced SR ~55% vs ~20% (Mobile-Agent) in one non-English setting

Practical UseFor complex, multi-step UI tasks, multi-agent coordination yields materially better outcomes than end-to-end single-agent control.

Evidence Ref4.3.1 / Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Success Rate (SR)average +27% vs Mobile-AgentMobile-Agent single-agent+27% (avg across English & non-English)dynamic real-device eval (88 instructions)Paper states Mobile-Agent-v2 achieves an average improvement of 27% in success rate (4.3.1)Tables 1-2
Advanced-task Success Rate (SR)55% (Mobile-Agent-v2) vs 20% (Mobile-Agent)Mobile-Agent single-agent+35 percentage pointsnon-English advanced instructionsText reports 55% vs 20% for advanced instructions (4.3.1)Table 1

What To Try In 7 Days

Build a simple three-role prototype: one module to summarize progress, one to act on screenshots, one to check outcomes.

Add a short-term memory that writes task-relevant texts/icons from screens and read it before actions.

Run 20 representative multi-app tasks on a test phone and compare SR/CR to your current single-agent or scripted baseline.

Agent Features

Memory
short-term focus content memory (stores task-relevant screen facts)
Planning
task progress summarization (text-only planning agent)
Tool Use
OCR (ConvNextViT-document)icon detection (GroundingDINO)icon description (Qwen-VL-Int4)ADB for execution
Frameworks
Mobile-Agent-v2
Is Agentic

Yes

Architectures
planning-decision-reflection multi-agent
Collaboration
text-passing between agents (planning -> decision -> reflection)

Optimization Features

System Optimization
context reduction via planning agent

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Relies on closed-source LLM APIs (GPT-4/GPT-4V) and external visual models.

Evaluation used a fixed set of 88 instructions; broader app diversity untested.

When Not To Use

When you must run fully on-device without cloud LLMs or network access.

If only trivial single-step UI tasks are needed (overhead not justified).

Failure Modes

Hallucinated or wrong-tap actions leading to unrelated pages.

Ineffective operations that cause no page change and must be retried.

Core Entities

Models

GPT-4GPT-4VQwen-VL-Int4ConvNextViT-documentGroundingDINO

Metrics

SRCRDARA

Benchmarks

dynamic real-device evaluation (88 instructions across apps and languages)

Context Entities

Models

GPT-4GPT-4VGemini-1.5-ProQwen-VL-Max

Metrics

SRCRDARA

Benchmarks

custom dynamic evaluation (system and external apps, multi-app tasks)