Use one-shot retrieval + light ML to run cheap, reliable UI tests at scale

September 12, 20248 min

Overview

Decision SnapshotReady For Pilot

The system shows strong practical gains on a large industrial dataset and in deployment, but results are from one app and rely on a closed LLM in experiments.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 50%

Authors

Sidong Feng, Haochuan Lu, Jianqin Jiang, Ting Xiong, Likun Huang, Yinglin Liang, Xiaoqin Li, Yuetang Deng, Aldeida Aleti

Links

Abstract / PDF

Why It Matters For Business

You can automate most feature-level UI tests cheaply by combining a single retrieved example with ML-based element matching and calling LLMs only for ambiguous cases, cutting LLM spend while keeping test quality.

Who Should Care

Summary TLDR

CAT is a hybrid system that combines retrieval-augmented LLM prompting (one-shot RAG) with lightweight machine learning to generate and execute UI automation tests for a large industrial app (WeChat). On a 39k task dataset CAT reaches 90% automated completion at $0.34 per task, matches AdbGPT's success rate but cuts LLM cost by ~68%. CAT is integrated into WeChat and ran 6.3k tests in six months, finding 141 bugs.

Problem Statement

Generating robust UI automation tests for industrial apps is costly and brittle. Pure ML or pure LLM solutions either fail on element mismatch or are too expensive at scale. The paper targets cost and knowledge gaps when applying LLMs to industry-level UI testing.

Main Contribution

CAT: a two-phase system that uses retrieval-augmented LLM prompts to decompose task descriptions, then ML plus LLM optimization to map actions to UI elements.

A practical RAG design: top-1 example retrieval (one-shot) using T5 embeddings + cosine similarity to provide few-shot context.

Key Findings

CAT automates 90% of test tasks.

Numbers90% completion (Table 3, CAT)

Practical UseExpect most high-level testing tasks to run without human steps when using CAT on similar apps.

Evidence RefTable 3 and Section 3.1

Average LLM cost per test with CAT is $0.34.

Numbers$0.34 per test (Table 3)

Practical UseBudget roughly $0.34 per automated test with the same setup; cost remains low enough for large test batches.

Evidence RefTable 3 and Section 3.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Completion rate90%CAT (no optimizer) 52%+38pp vs CAT (no optimizer)WeChat test split (2,010 tasks)Table 3 reports 90% for CATTable 3
Average LLM cost$0.34AdbGPT $1.07-$0.73 vs AdbGPTWeChat full evaluation (39k tasks)Table 4 shows CAT $0.34 vs AdbGPT $1.07Table 4

What To Try In 7 Days

Index past test scripts and build T5 embeddings for retrieval.

Prompt an off-the-shelf LLM with one retrieved example to generate action sequences.

Use a cheap ML matcher for element mapping and call LLM only when similarity is below a threshold.

Agent Features

Memory
retrieval memory (indexed past usages)
Planning
task decomposition
Tool Use
RAG retrievalLLM as optimizer
Frameworks
hybrid ML + LLM

Optimization Features

Token Efficiency
one-shot context reduces tokens vs N-shot
Infra Optimization
option to replace ChatGPT with open models (LLaMA70B tested)
System Optimization
hybrid ML primary + LLM secondary to lower cost
Inference Optimization
call LLMs only for low-similarity casesuse top-1 retrieval to shrink prompt size

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation is limited to WeChat; generality to other apps is claimed but not fully measured.

Relies on ChatGPT (gpt-4) in experiments; open models performed worse in tests.

When Not To Use

When you cannot send UI data to an external LLM service for privacy or licensing reasons.

When your app's UI relies on heavy visual/layout context that text view hierarchies cannot capture.

Failure Modes

LLM hallucination leading to invalid action sequences.

Instruction forgetting and format errors when many examples are included in the prompt.

Core Entities

Models

ChatGPT (gpt-4)LLaMA70BT5 encoder

Metrics

completion rateaverage LLM costaverage time per test

Datasets

WeChat testing dataset (39,981 task descriptions)Retrieval dataset (37,971 examples)Testing split (2,010 examples)