Overview
The system is demonstrated on 450 real production test cases and has been deployed in WeChat Pay, so readiness and cost impact are high; novelty is moderate because it combines existing LLM-agent ideas with engineering for GUI testing.
Citations2
Evidence Strength0.70
Confidence0.78
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: No open assets linked
Open source: No
At A Glance
Cost impact: 70%
Production readiness: 80%
Novelty: 60%
Why It Matters For Business
Automates the most labor-intensive step in UAT (script generation), cutting manual tester time and making daily regression testing faster and more consistent.
Who Should Care
Summary TLDR
The authors build XUAT-Copilot: three LLM-based agents (operation, parameter selection, inspection) plus perception and rewriting modules to turn UAT steps into ADB commands for Android. On 450 real WeChat Pay test cases the multi-agent system reached 88.55% case pass rate (Pass@1) and 93.03% step completion (Complete@1), far above a single-agent variant. The system is deployed in WeChat Pay's test environment and reduced manual scripting work. Key ideas: rewrite terse test steps, filter GUI view hierarchy, extract clickable bounds from screenshots, split work across specialized LLM agents, and use self-reflection to fix invalid actions.
Problem Statement
Turning semi-structured UAT steps into executable Android Debug Bridge (ADB) scripts is still manual and error-prone. The task: given a test case (ordered steps), a large parameter list, current GUI state (view hierarchy + screenshot) and a skill library (ADB-wrapping functions), generate a correct sequence of actions that follows each step exactly.
Main Contribution
Design of XUAT-Copilot, an LLM-powered multi-agent pipeline to generate UAT scripts from test-case steps and GUI state.
Practical modules: GUI perception (filtered view hierarchy + image-based hyperlink bounds), a rewriting step to clarify terse instructions, and a skill library of ADB-wrapped commands.
Key Findings
Multi-agent system greatly improves pass rates versus a single-agent LLM.
Adding self-reflection in prompts meaningfully increases accuracy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Pass@1 (case-level pass rate) | 88.55% | Single Agent: 22.65% | +65.90pp vs Single Agent | 450 WeChat Pay UAT cases | Table 4 reports method scores | Table 4 |
| Complete@1 (step-level completion rate) | 93.03% | Single Agent: 25.25% | +67.78pp vs Single Agent | 450 WeChat Pay UAT cases | Table 4 reports method scores | Table 4 |
What To Try In 7 Days
Prototype a small multi-agent pipeline: separate planning, parameter selection, and state check prompts.
Collect a tiny set (50–100) of real UAT steps with view hierarchies and screenshots to test the pipeline end-to-end.
Add a simple reflection loop: log invalid actions and re-prompt the planner to correct them.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Reproducibility
Risks & Boundaries
Limitations
Relies on an internal parameter list and many rule-based rewriting parts, limiting portability.
Results reported only on WeChat Pay cases; generalization to other apps is untested.
When Not To Use
When you lack structured view hierarchy or reliable screenshots for the app under test.
For security- or privacy-sensitive flows where automated inputs could leak secrets.
Failure Modes
LLM hallucination chooses invalid or non-existent skill names leading to execution errors.
Wrong parameter selection from large parameter lists causing incorrect inputs.

