Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
Automates the most labor-intensive step in UAT (script generation), cutting manual tester time and making daily regression testing faster and more consistent.
Summary TLDR
The authors build XUAT-Copilot: three LLM-based agents (operation, parameter selection, inspection) plus perception and rewriting modules to turn UAT steps into ADB commands for Android. On 450 real WeChat Pay test cases the multi-agent system reached 88.55% case pass rate (Pass@1) and 93.03% step completion (Complete@1), far above a single-agent variant. The system is deployed in WeChat Pay's test environment and reduced manual scripting work. Key ideas: rewrite terse test steps, filter GUI view hierarchy, extract clickable bounds from screenshots, split work across specialized LLM agents, and use self-reflection to fix invalid actions.
Problem Statement
Turning semi-structured UAT steps into executable Android Debug Bridge (ADB) scripts is still manual and error-prone. The task: given a test case (ordered steps), a large parameter list, current GUI state (view hierarchy + screenshot) and a skill library (ADB-wrapping functions), generate a correct sequence of actions that follows each step exactly.
Main Contribution
Design of XUAT-Copilot, an LLM-powered multi-agent pipeline to generate UAT scripts from test-case steps and GUI state.
Practical modules: GUI perception (filtered view hierarchy + image-based hyperlink bounds), a rewriting step to clarify terse instructions, and a skill library of ADB-wrapped commands.
Empirical demonstration on 450 real WeChat Pay UAT cases, showing large gains over single-agent baselines and a deployed production rollout.
Key Findings
Multi-agent system greatly improves pass rates versus a single-agent LLM.
Adding self-reflection in prompts meaningfully increases accuracy.
Method was tested on real production test data and deployed.
Results
Pass@1 (case-level pass rate)
Complete@1 (step-level completion rate)
Effect of reflection ablation
Who Should Care
What To Try In 7 Days
Prototype a small multi-agent pipeline: separate planning, parameter selection, and state check prompts.
Collect a tiny set (50–100) of real UAT steps with view hierarchies and screenshots to test the pipeline end-to-end.
Add a simple reflection loop: log invalid actions and re-prompt the planner to correct them.
Agent Features
Memory
- working memory M (short-term conversation summary)
- invalid-action history for reflection
Planning
- LLM-based action planning
- self-reflection correction (prompted)
Tool Use
- ADB-wrapped skill library
- OCR/text-detection (SegLink++, ConvNeXt)
Frameworks
- ReAct-like prompting
- chain-of-thought style decomposition
- zero-shot in-context prompting
Is Agentic
true
Architectures
- multi-agent
Collaboration
- disordered cooperation (agents interact via shared prompts and memory)
Optimization Features
Token Efficiency
- filtering view hierarchy to reduce tokens
System Optimization
- skill library of ADB wrappers to limit allowed commands
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Relies on an internal parameter list and many rule-based rewriting parts, limiting portability.
- Results reported only on WeChat Pay cases; generalization to other apps is untested.
- Paper does not specify the exact LLM model or prompt tokens used in experiments.
- Perception depends on accurate view hierarchy and OCR; failures there break the pipeline.
When Not To Use
- When you lack structured view hierarchy or reliable screenshots for the app under test.
- For security- or privacy-sensitive flows where automated inputs could leak secrets.
- When UAT requirements require domain knowledge not captured by rewriting rules or parameters.
Failure Modes
- LLM hallucination chooses invalid or non-existent skill names leading to execution errors.
- Wrong parameter selection from large parameter lists causing incorrect inputs.
- OCR/text-detection misses hyperlink bounds so click targets are wrong.
- Long test steps exceeding LLM token limits cause forgetting or truncated context.
Core Entities
Models
- GPT-3.5/GPT-4 (backbone LLMs referenced; exact model for experiments not specified)
- LLaMA (referenced)
Metrics
- Pass@1 (case-level pass rate)
- Complete@1 (step-level completion rate)
Datasets
- WeChat Pay UAT test cases (450 cases, internal dataset)

