Split planning, decision and reflection agents to boost mobile UI automation and task completion

Overview

Decision SnapshotReady For Pilot

Splitting roles reduces input length and concentrates responsibilities; dynamic real-device tests show consistent gains, but results depend on closed APIs (GPT-4/GPT-4V), visual tool quality, and a limited 88-instruction benchmark.

Citations1

Evidence Strength0.80

Confidence0.87

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang

Links

Abstract / PDF / Code

Why It Matters For Business

Splitting UI automation into planning, decision, and reflection raises task success for real-device workflows, reducing manual scripting and improving coverage for multi-app automation and testing.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Founder

Summary TLDR

Mobile-Agent-v2 replaces a single MLLM agent with three specialized agents — planning, decision, and reflection — plus a short-term memory and a visual perception module. This splits long interleaved image-text histories into compact task progress (planning), uses screen-aware decisions (decision), and checks outcomes/retries (reflection). On real-device dynamic tests (88 instructions across system/external/multi-app tasks) it raises success rates substantially versus the single-agent Mobile-Agent; multi-app SR and CR improved by 37.5% and 44.2%, and advanced-task SR rose from ~20% to ~55% in one setting. Manual operation “knowledge injection” can push results near 100% on many basic tasks.

Problem Statement

Single-agent MLLM approaches struggle on mobile UI tasks for two reasons: (1) task progress navigation — tracking which substeps are already done in a long, interleaved image+text history, and (2) focus content navigation — retrieving task-relevant text/images from prior screens. Long sequence length and mixed modalities reduce single-agent effectiveness, causing wrong or ineffective taps.

Main Contribution

A three-role multi-agent architecture (planning, decision, reflection) that splits responsibilities to reduce context length and improve navigation.

A short-term memory unit that stores task-relevant focus content extracted from screens for later use.

Key Findings

Multi-agent design raises task completion vs single-agent Mobile-Agent.

Numbersaverage SR +27% (across English/Chinese evals)

Practical UseUse a planning/decision/reflection split to lift overall automation success on real devices.

Evidence Ref4.3.1 / Tables 1-2

Hard (advanced) instructions improved sharply under Mobile-Agent-v2.

Numbersadvanced SR ~55% vs ~20% (Mobile-Agent) in one non-English setting

Practical UseFor complex, multi-step UI tasks, multi-agent coordination yields materially better outcomes than end-to-end single-agent control.

Evidence Ref4.3.1 / Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Success Rate (SR)	average +27% vs Mobile-Agent	Mobile-Agent single-agent	+27% (avg across English & non-English)	dynamic real-device eval (88 instructions)	Paper states Mobile-Agent-v2 achieves an average improvement of 27% in success rate (4.3.1)	Tables 1-2
Advanced-task Success Rate (SR)	55% (Mobile-Agent-v2) vs 20% (Mobile-Agent)	Mobile-Agent single-agent	+35 percentage points	non-English advanced instructions	Text reports 55% vs 20% for advanced instructions (4.3.1)	Table 1

What To Try In 7 Days

Build a simple three-role prototype: one module to summarize progress, one to act on screenshots, one to check outcomes.

Add a short-term memory that writes task-relevant texts/icons from screens and read it before actions.

Run 20 representative multi-app tasks on a test phone and compare SR/CR to your current single-agent or scripted baseline.

Agent Features

Memory

short-term focus content memory (stores task-relevant screen facts)

Planning

task progress summarization (text-only planning agent)

Tool Use

OCR (ConvNextViT-document)icon detection (GroundingDINO)icon description (Qwen-VL-Int4)ADB for execution

Frameworks

Mobile-Agent-v2

Is Agentic

Yes

Architectures

planning-decision-reflection multi-agent

Collaboration

text-passing between agents (planning -> decision -> reflection)

Optimization Features

System Optimization

context reduction via planning agent

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/X-PLUG/MobileAgent

Risks & Boundaries

Limitations

Relies on closed-source LLM APIs (GPT-4/GPT-4V) and external visual models.

Evaluation used a fixed set of 88 instructions; broader app diversity untested.

When Not To Use

When you must run fully on-device without cloud LLMs or network access.

If only trivial single-step UI tasks are needed (overhead not justified).

Failure Modes

Hallucinated or wrong-tap actions leading to unrelated pages.

Ineffective operations that cause no page change and must be retried.

Core Entities

Models

GPT-4GPT-4VQwen-VL-Int4ConvNextViT-documentGroundingDINO

Metrics

SRCRDARA

Benchmarks

dynamic real-device evaluation (88 instructions across apps and languages)

Context Entities

Models

GPT-4GPT-4VGemini-1.5-ProQwen-VL-Max

Metrics

SRCRDARA

Benchmarks

custom dynamic evaluation (system and external apps, multi-app tasks)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Multi-agent design raises task completion vs single-agent Mobile-Agent.

Hard (advanced) instructions improved sharply under Mobile-Agent-v2.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Benchmarks

Context Entities

Models

Metrics

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding