LLM-powered multi-agent system automates WeChat Pay UAT and achieves 88.6% Pass@1

January 5, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

Zhitao Wang, Wei Wang, Zirao Li, Long Wang, Can Yi, Xinjie Xu, Luyang Cao, Hanjing Su, Shouzhi Chen, Jun Zhou

Links

Abstract / PDF

Why It Matters For Business

Automates the most labor-intensive step in UAT (script generation), cutting manual tester time and making daily regression testing faster and more consistent.

Summary TLDR

The authors build XUAT-Copilot: three LLM-based agents (operation, parameter selection, inspection) plus perception and rewriting modules to turn UAT steps into ADB commands for Android. On 450 real WeChat Pay test cases the multi-agent system reached 88.55% case pass rate (Pass@1) and 93.03% step completion (Complete@1), far above a single-agent variant. The system is deployed in WeChat Pay's test environment and reduced manual scripting work. Key ideas: rewrite terse test steps, filter GUI view hierarchy, extract clickable bounds from screenshots, split work across specialized LLM agents, and use self-reflection to fix invalid actions.

Problem Statement

Turning semi-structured UAT steps into executable Android Debug Bridge (ADB) scripts is still manual and error-prone. The task: given a test case (ordered steps), a large parameter list, current GUI state (view hierarchy + screenshot) and a skill library (ADB-wrapping functions), generate a correct sequence of actions that follows each step exactly.

Main Contribution

Design of XUAT-Copilot, an LLM-powered multi-agent pipeline to generate UAT scripts from test-case steps and GUI state.

Practical modules: GUI perception (filtered view hierarchy + image-based hyperlink bounds), a rewriting step to clarify terse instructions, and a skill library of ADB-wrapped commands.

Empirical demonstration on 450 real WeChat Pay UAT cases, showing large gains over single-agent baselines and a deployed production rollout.

Key Findings

Multi-agent system greatly improves pass rates versus a single-agent LLM.

NumbersPass@1: 88.55% vs 22.65% (Table 4)

Adding self-reflection in prompts meaningfully increases accuracy.

NumbersPass@1: 88.55% (with reflection) vs 81.96% (no reflection) (Table 4)

Method was tested on real production test data and deployed.

Numbers450 test cases; average 7 steps and 15 actions per case (Table 3)

Results

Pass@1 (case-level pass rate)

Value88.55%

BaselineSingle Agent: 22.65%

Complete@1 (step-level completion rate)

Value93.03%

BaselineSingle Agent: 25.25%

Effect of reflection ablation

ValuePass@1 81.96%, Complete@1 89.39%

BaselineXUAT-Copilot: 88.55% / 93.03%

Who Should Care

What To Try In 7 Days

Prototype a small multi-agent pipeline: separate planning, parameter selection, and state check prompts.

Collect a tiny set (50–100) of real UAT steps with view hierarchies and screenshots to test the pipeline end-to-end.

Add a simple reflection loop: log invalid actions and re-prompt the planner to correct them.

Agent Features

Memory

  • working memory M (short-term conversation summary)
  • invalid-action history for reflection

Planning

  • LLM-based action planning
  • self-reflection correction (prompted)

Tool Use

  • ADB-wrapped skill library
  • OCR/text-detection (SegLink++, ConvNeXt)

Frameworks

  • ReAct-like prompting
  • chain-of-thought style decomposition
  • zero-shot in-context prompting

Is Agentic

true

Architectures

  • multi-agent

Collaboration

  • disordered cooperation (agents interact via shared prompts and memory)

Optimization Features

Token Efficiency

  • filtering view hierarchy to reduce tokens

System Optimization

  • skill library of ADB wrappers to limit allowed commands

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Relies on an internal parameter list and many rule-based rewriting parts, limiting portability.
  • Results reported only on WeChat Pay cases; generalization to other apps is untested.
  • Paper does not specify the exact LLM model or prompt tokens used in experiments.
  • Perception depends on accurate view hierarchy and OCR; failures there break the pipeline.

When Not To Use

  • When you lack structured view hierarchy or reliable screenshots for the app under test.
  • For security- or privacy-sensitive flows where automated inputs could leak secrets.
  • When UAT requirements require domain knowledge not captured by rewriting rules or parameters.

Failure Modes

  • LLM hallucination chooses invalid or non-existent skill names leading to execution errors.
  • Wrong parameter selection from large parameter lists causing incorrect inputs.
  • OCR/text-detection misses hyperlink bounds so click targets are wrong.
  • Long test steps exceeding LLM token limits cause forgetting or truncated context.

Core Entities

Models

  • GPT-3.5/GPT-4 (backbone LLMs referenced; exact model for experiments not specified)
  • LLaMA (referenced)

Metrics

  • Pass@1 (case-level pass rate)
  • Complete@1 (step-level completion rate)

Datasets

  • WeChat Pay UAT test cases (450 cases, internal dataset)