LLM-powered multi-agent system automates WeChat Pay UAT and achieves 88.6% Pass@1

Overview

Decision SnapshotNeeds Validation

The system is demonstrated on 450 real production test cases and has been deployed in WeChat Pay, so readiness and cost impact are high; novelty is moderate because it combines existing LLM-agent ideas with engineering for GUI testing.

Citations2

Evidence Strength0.70

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 60%

Authors

Zhitao Wang, Wei Wang, Zirao Li, Long Wang, Can Yi, Xinjie Xu, Luyang Cao, Hanjing Su, Shouzhi Chen, Jun Zhou

Links

Abstract / PDF

Why It Matters For Business

Automates the most labor-intensive step in UAT (script generation), cutting manual tester time and making daily regression testing faster and more consistent.

Who Should Care

CTO Product Manager Engineering Lead ML Engineer

Summary TLDR

The authors build XUAT-Copilot: three LLM-based agents (operation, parameter selection, inspection) plus perception and rewriting modules to turn UAT steps into ADB commands for Android. On 450 real WeChat Pay test cases the multi-agent system reached 88.55% case pass rate (Pass@1) and 93.03% step completion (Complete@1), far above a single-agent variant. The system is deployed in WeChat Pay's test environment and reduced manual scripting work. Key ideas: rewrite terse test steps, filter GUI view hierarchy, extract clickable bounds from screenshots, split work across specialized LLM agents, and use self-reflection to fix invalid actions.

Problem Statement

Turning semi-structured UAT steps into executable Android Debug Bridge (ADB) scripts is still manual and error-prone. The task: given a test case (ordered steps), a large parameter list, current GUI state (view hierarchy + screenshot) and a skill library (ADB-wrapping functions), generate a correct sequence of actions that follows each step exactly.

Main Contribution

Design of XUAT-Copilot, an LLM-powered multi-agent pipeline to generate UAT scripts from test-case steps and GUI state.

Practical modules: GUI perception (filtered view hierarchy + image-based hyperlink bounds), a rewriting step to clarify terse instructions, and a skill library of ADB-wrapped commands.

Key Findings

Multi-agent system greatly improves pass rates versus a single-agent LLM.

NumbersPass@1: 88.55% vs 22.65% (Table 4)

Practical UseSplit planning, parameter choice, and inspection across agents to raise UAT script success; single monolithic prompts perform poorly.

Evidence RefTable 4

Adding self-reflection in prompts meaningfully increases accuracy.

NumbersPass@1: 88.55% (with reflection) vs 81.96% (no reflection) (Table 4)

Practical UseInclude a reflection prompt that records invalid actions and asks the agent to correct them to boost reliability.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pass@1 (case-level pass rate)	88.55%	Single Agent: 22.65%	+65.90pp vs Single Agent	450 WeChat Pay UAT cases	Table 4 reports method scores	Table 4
Complete@1 (step-level completion rate)	93.03%	Single Agent: 25.25%	+67.78pp vs Single Agent	450 WeChat Pay UAT cases	Table 4 reports method scores	Table 4

What To Try In 7 Days

Prototype a small multi-agent pipeline: separate planning, parameter selection, and state check prompts.

Collect a tiny set (50–100) of real UAT steps with view hierarchies and screenshots to test the pipeline end-to-end.

Add a simple reflection loop: log invalid actions and re-prompt the planner to correct them.

Agent Features

Memory

working memory M (short-term conversation summary)invalid-action history for reflection

Planning

LLM-based action planningself-reflection correction (prompted)

Tool Use

ADB-wrapped skill libraryOCR/text-detection (SegLink++, ConvNeXt)

Frameworks

ReAct-like promptingchain-of-thought style decompositionzero-shot in-context prompting

Is Agentic

Yes

Architectures

multi-agent

Collaboration

disordered cooperation (agents interact via shared prompts and memory)

Optimization Features

Token Efficiency

filtering view hierarchy to reduce tokens

System Optimization

skill library of ADB wrappers to limit allowed commands

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Relies on an internal parameter list and many rule-based rewriting parts, limiting portability.

Results reported only on WeChat Pay cases; generalization to other apps is untested.

When Not To Use

When you lack structured view hierarchy or reliable screenshots for the app under test.

For security- or privacy-sensitive flows where automated inputs could leak secrets.

Failure Modes

LLM hallucination chooses invalid or non-existent skill names leading to execution errors.

Wrong parameter selection from large parameter lists causing incorrect inputs.

Core Entities

Models

GPT-3.5/GPT-4 (backbone LLMs referenced; exact model for experiments not specified)LLaMA (referenced)

Metrics

Pass@1 (case-level pass rate)Complete@1 (step-level completion rate)

Datasets

WeChat Pay UAT test cases (450 cases, internal dataset)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Multi-agent system greatly improves pass rates versus a single-agent LLM.

Adding self-reflection in prompts meaningfully increases accuracy.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding