Overview
The system shows strong practical gains on a large industrial dataset and in deployment, but results are from one app and rely on a closed LLM in experiments.
Citations0
Evidence Strength0.75
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/4
Reproducibility
Status: No open assets linked
Open source: No
At A Glance
Cost impact: 80%
Production readiness: 80%
Novelty: 50%
Why It Matters For Business
You can automate most feature-level UI tests cheaply by combining a single retrieved example with ML-based element matching and calling LLMs only for ambiguous cases, cutting LLM spend while keeping test quality.
Who Should Care
Summary TLDR
CAT is a hybrid system that combines retrieval-augmented LLM prompting (one-shot RAG) with lightweight machine learning to generate and execute UI automation tests for a large industrial app (WeChat). On a 39k task dataset CAT reaches 90% automated completion at $0.34 per task, matches AdbGPT's success rate but cuts LLM cost by ~68%. CAT is integrated into WeChat and ran 6.3k tests in six months, finding 141 bugs.
Problem Statement
Generating robust UI automation tests for industrial apps is costly and brittle. Pure ML or pure LLM solutions either fail on element mismatch or are too expensive at scale. The paper targets cost and knowledge gaps when applying LLMs to industry-level UI testing.
Main Contribution
CAT: a two-phase system that uses retrieval-augmented LLM prompts to decompose task descriptions, then ML plus LLM optimization to map actions to UI elements.
A practical RAG design: top-1 example retrieval (one-shot) using T5 embeddings + cosine similarity to provide few-shot context.
Key Findings
CAT automates 90% of test tasks.
Average LLM cost per test with CAT is $0.34.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Completion rate | 90% | CAT (no optimizer) 52% | +38pp vs CAT (no optimizer) | WeChat test split (2,010 tasks) | Table 3 reports 90% for CAT | Table 3 |
| Average LLM cost | $0.34 | AdbGPT $1.07 | -$0.73 vs AdbGPT | WeChat full evaluation (39k tasks) | Table 4 shows CAT $0.34 vs AdbGPT $1.07 | Table 4 |
What To Try In 7 Days
Index past test scripts and build T5 embeddings for retrieval.
Prompt an off-the-shelf LLM with one retrieved example to generate action sequences.
Use a cheap ML matcher for element mapping and call LLM only when similarity is below a threshold.
Agent Features
Memory
Planning
Tool Use
Frameworks
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluation is limited to WeChat; generality to other apps is claimed but not fully measured.
Relies on ChatGPT (gpt-4) in experiments; open models performed worse in tests.
When Not To Use
When you cannot send UI data to an external LLM service for privacy or licensing reasons.
When your app's UI relies on heavy visual/layout context that text view hierarchies cannot capture.
Failure Modes
LLM hallucination leading to invalid action sequences.
Instruction forgetting and format errors when many examples are included in the prompt.

