Split planning, decision and reflection agents to boost mobile UI automation and task completion

June 3, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang

Links

Abstract / PDF

Why It Matters For Business

Splitting UI automation into planning, decision, and reflection raises task success for real-device workflows, reducing manual scripting and improving coverage for multi-app automation and testing.

Summary TLDR

Mobile-Agent-v2 replaces a single MLLM agent with three specialized agents — planning, decision, and reflection — plus a short-term memory and a visual perception module. This splits long interleaved image-text histories into compact task progress (planning), uses screen-aware decisions (decision), and checks outcomes/retries (reflection). On real-device dynamic tests (88 instructions across system/external/multi-app tasks) it raises success rates substantially versus the single-agent Mobile-Agent; multi-app SR and CR improved by 37.5% and 44.2%, and advanced-task SR rose from ~20% to ~55% in one setting. Manual operation “knowledge injection” can push results near 100% on many basic tasks.

Problem Statement

Single-agent MLLM approaches struggle on mobile UI tasks for two reasons: (1) task progress navigation — tracking which substeps are already done in a long, interleaved image+text history, and (2) focus content navigation — retrieving task-relevant text/images from prior screens. Long sequence length and mixed modalities reduce single-agent effectiveness, causing wrong or ineffective taps.

Main Contribution

A three-role multi-agent architecture (planning, decision, reflection) that splits responsibilities to reduce context length and improve navigation.

A short-term memory unit that stores task-relevant focus content extracted from screens for later use.

A reflection agent that compares before/after screenshots to detect erroneous or ineffective actions and triggers retries.

A practical visual perception module (OCR and icon detection) and a dynamic real-device evaluation across OSes and languages.

Key Findings

Multi-agent design raises task completion vs single-agent Mobile-Agent.

Numbersaverage SR +27% (across English/Chinese evals)

Hard (advanced) instructions improved sharply under Mobile-Agent-v2.

Numbersadvanced SR ~55% vs ~20% (Mobile-Agent) in one non-English setting

Multi-app scenarios benefit the most from multi-agent memory + reflection.

Numbersmulti-app SR +37.5%, CR +44.2% vs Mobile-Agent

Manual operation knowledge (hints/tutorials) can push performance to near-perfect on many tasks.

Numbersbasic SR up to 100% and large gains on advanced tasks when injected

Agent role design matters: planning agent is most impactful.

NumbersAblation reduces basic SR from ~88.6 to ~59.1 when planning removed (Table 4)

Results

Success Rate (SR)

Valueaverage +27% vs Mobile-Agent

BaselineMobile-Agent single-agent

Advanced-task Success Rate (SR)

Value55% (Mobile-Agent-v2) vs 20% (Mobile-Agent)

BaselineMobile-Agent single-agent

Multi-app SR / CR

ValueSR +37.5%, CR +44.2%

BaselineMobile-Agent single-agent

Accuracy

ValueGPT-4V w/ agent SR 92.7 (basic) vs GPT-4V w/o agent 2.7

BaselineGPT-4V end-to-end

Who Should Care

What To Try In 7 Days

Build a simple three-role prototype: one module to summarize progress, one to act on screenshots, one to check outcomes.

Add a short-term memory that writes task-relevant texts/icons from screens and read it before actions.

Run 20 representative multi-app tasks on a test phone and compare SR/CR to your current single-agent or scripted baseline.

Agent Features

Memory

  • short-term focus content memory (stores task-relevant screen facts)

Planning

  • task progress summarization (text-only planning agent)

Tool Use

  • OCR (ConvNextViT-document)
  • icon detection (GroundingDINO)
  • icon description (Qwen-VL-Int4)
  • ADB for execution

Frameworks

  • Mobile-Agent-v2

Is Agentic

true

Architectures

  • planning-decision-reflection multi-agent

Collaboration

  • text-passing between agents (planning -> decision -> reflection)

Optimization Features

System Optimization

  • context reduction via planning agent

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on closed-source LLM APIs (GPT-4/GPT-4V) and external visual models.
  • Evaluation used a fixed set of 88 instructions; broader app diversity untested.
  • Performance still benefits from manual knowledge injection, indicating brittleness on some steps.

When Not To Use

  • When you must run fully on-device without cloud LLMs or network access.
  • If only trivial single-step UI tasks are needed (overhead not justified).
  • When you lack reliable screen OCR or icon detection for your UI styles.

Failure Modes

  • Hallucinated or wrong-tap actions leading to unrelated pages.
  • Ineffective operations that cause no page change and must be retried.
  • Memory can store incorrect focus content if perception errs.
  • High API cost and latency when using large cloud LLMs for each agent call.

Core Entities

Models

  • GPT-4
  • GPT-4V
  • Qwen-VL-Int4
  • ConvNextViT-document
  • GroundingDINO

Metrics

  • SR
  • CR
  • DA
  • RA

Benchmarks

  • dynamic real-device evaluation (88 instructions across apps and languages)

Context Entities

Models

  • GPT-4
  • GPT-4V
  • Gemini-1.5-Pro
  • Qwen-VL-Max

Metrics

  • SR
  • CR
  • DA
  • RA

Benchmarks

  • custom dynamic evaluation (system and external apps, multi-app tasks)