LLM-powered multi-agent system automates WeChat Pay UAT and achieves 88.6% Pass@1

January 5, 20247 min

Overview

Decision SnapshotNeeds Validation

The system is demonstrated on 450 real production test cases and has been deployed in WeChat Pay, so readiness and cost impact are high; novelty is moderate because it combines existing LLM-agent ideas with engineering for GUI testing.

Citations2

Evidence Strength0.70

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 60%

Authors

Zhitao Wang, Wei Wang, Zirao Li, Long Wang, Can Yi, Xinjie Xu, Luyang Cao, Hanjing Su, Shouzhi Chen, Jun Zhou

Links

Abstract / PDF

Why It Matters For Business

Automates the most labor-intensive step in UAT (script generation), cutting manual tester time and making daily regression testing faster and more consistent.

Who Should Care

Summary TLDR

The authors build XUAT-Copilot: three LLM-based agents (operation, parameter selection, inspection) plus perception and rewriting modules to turn UAT steps into ADB commands for Android. On 450 real WeChat Pay test cases the multi-agent system reached 88.55% case pass rate (Pass@1) and 93.03% step completion (Complete@1), far above a single-agent variant. The system is deployed in WeChat Pay's test environment and reduced manual scripting work. Key ideas: rewrite terse test steps, filter GUI view hierarchy, extract clickable bounds from screenshots, split work across specialized LLM agents, and use self-reflection to fix invalid actions.

Problem Statement

Turning semi-structured UAT steps into executable Android Debug Bridge (ADB) scripts is still manual and error-prone. The task: given a test case (ordered steps), a large parameter list, current GUI state (view hierarchy + screenshot) and a skill library (ADB-wrapping functions), generate a correct sequence of actions that follows each step exactly.

Main Contribution

Design of XUAT-Copilot, an LLM-powered multi-agent pipeline to generate UAT scripts from test-case steps and GUI state.

Practical modules: GUI perception (filtered view hierarchy + image-based hyperlink bounds), a rewriting step to clarify terse instructions, and a skill library of ADB-wrapped commands.

Key Findings

Multi-agent system greatly improves pass rates versus a single-agent LLM.

NumbersPass@1: 88.55% vs 22.65% (Table 4)

Practical UseSplit planning, parameter choice, and inspection across agents to raise UAT script success; single monolithic prompts perform poorly.

Evidence RefTable 4

Adding self-reflection in prompts meaningfully increases accuracy.

NumbersPass@1: 88.55% (with reflection) vs 81.96% (no reflection) (Table 4)

Practical UseInclude a reflection prompt that records invalid actions and asks the agent to correct them to boost reliability.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pass@1 (case-level pass rate)88.55%Single Agent: 22.65%+65.90pp vs Single Agent450 WeChat Pay UAT casesTable 4 reports method scoresTable 4
Complete@1 (step-level completion rate)93.03%Single Agent: 25.25%+67.78pp vs Single Agent450 WeChat Pay UAT casesTable 4 reports method scoresTable 4

What To Try In 7 Days

Prototype a small multi-agent pipeline: separate planning, parameter selection, and state check prompts.

Collect a tiny set (50–100) of real UAT steps with view hierarchies and screenshots to test the pipeline end-to-end.

Add a simple reflection loop: log invalid actions and re-prompt the planner to correct them.

Agent Features

Memory
working memory M (short-term conversation summary)invalid-action history for reflection
Planning
LLM-based action planningself-reflection correction (prompted)
Tool Use
ADB-wrapped skill libraryOCR/text-detection (SegLink++, ConvNeXt)
Frameworks
ReAct-like promptingchain-of-thought style decompositionzero-shot in-context prompting
Is Agentic

Yes

Architectures
multi-agent
Collaboration
disordered cooperation (agents interact via shared prompts and memory)

Optimization Features

Token Efficiency
filtering view hierarchy to reduce tokens
System Optimization
skill library of ADB wrappers to limit allowed commands

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Relies on an internal parameter list and many rule-based rewriting parts, limiting portability.

Results reported only on WeChat Pay cases; generalization to other apps is untested.

When Not To Use

When you lack structured view hierarchy or reliable screenshots for the app under test.

For security- or privacy-sensitive flows where automated inputs could leak secrets.

Failure Modes

LLM hallucination chooses invalid or non-existent skill names leading to execution errors.

Wrong parameter selection from large parameter lists causing incorrect inputs.

Core Entities

Models

GPT-3.5/GPT-4 (backbone LLMs referenced; exact model for experiments not specified)LLaMA (referenced)

Metrics

Pass@1 (case-level pass rate)Complete@1 (step-level completion rate)

Datasets

WeChat Pay UAT test cases (450 cases, internal dataset)