Use one-shot retrieval + light ML to run cheap, reliable UI tests at scale

Overview

Decision SnapshotReady For Pilot

The system shows strong practical gains on a large industrial dataset and in deployment, but results are from one app and rely on a closed LLM in experiments.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 50%

Authors

Sidong Feng, Haochuan Lu, Jianqin Jiang, Ting Xiong, Likun Huang, Yinglin Liang, Xiaoqin Li, Yuetang Deng, Aldeida Aleti

Links

Abstract / PDF

Why It Matters For Business

You can automate most feature-level UI tests cheaply by combining a single retrieved example with ML-based element matching and calling LLMs only for ambiguous cases, cutting LLM spend while keeping test quality.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

CAT is a hybrid system that combines retrieval-augmented LLM prompting (one-shot RAG) with lightweight machine learning to generate and execute UI automation tests for a large industrial app (WeChat). On a 39k task dataset CAT reaches 90% automated completion at $0.34 per task, matches AdbGPT's success rate but cuts LLM cost by ~68%. CAT is integrated into WeChat and ran 6.3k tests in six months, finding 141 bugs.

Problem Statement

Generating robust UI automation tests for industrial apps is costly and brittle. Pure ML or pure LLM solutions either fail on element mismatch or are too expensive at scale. The paper targets cost and knowledge gaps when applying LLMs to industry-level UI testing.

Main Contribution

CAT: a two-phase system that uses retrieval-augmented LLM prompts to decompose task descriptions, then ML plus LLM optimization to map actions to UI elements.

A practical RAG design: top-1 example retrieval (one-shot) using T5 embeddings + cosine similarity to provide few-shot context.

Key Findings

CAT automates 90% of test tasks.

Numbers90% completion (Table 3, CAT)

Practical UseExpect most high-level testing tasks to run without human steps when using CAT on similar apps.

Evidence RefTable 3 and Section 3.1

Average LLM cost per test with CAT is $0.34.

Numbers$0.34 per test (Table 3)

Practical UseBudget roughly $0.34 per automated test with the same setup; cost remains low enough for large test batches.

Evidence RefTable 3 and Section 3.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Completion rate	90%	CAT (no optimizer) 52%	+38pp vs CAT (no optimizer)	WeChat test split (2,010 tasks)	Table 3 reports 90% for CAT	Table 3
Average LLM cost	$0.34	AdbGPT $1.07	-$0.73 vs AdbGPT	WeChat full evaluation (39k tasks)	Table 4 shows CAT $0.34 vs AdbGPT $1.07	Table 4

What To Try In 7 Days

Index past test scripts and build T5 embeddings for retrieval.

Prompt an off-the-shelf LLM with one retrieved example to generate action sequences.

Use a cheap ML matcher for element mapping and call LLM only when similarity is below a threshold.

Agent Features

Memory

retrieval memory (indexed past usages)

Planning

task decomposition

Tool Use

RAG retrievalLLM as optimizer

Frameworks

hybrid ML + LLM

Optimization Features

Token Efficiency

one-shot context reduces tokens vs N-shot

Infra Optimization

option to replace ChatGPT with open models (LLaMA70B tested)

System Optimization

hybrid ML primary + LLM secondary to lower cost

Inference Optimization

call LLMs only for low-similarity casesuse top-1 retrieval to shrink prompt size

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation is limited to WeChat; generality to other apps is claimed but not fully measured.

Relies on ChatGPT (gpt-4) in experiments; open models performed worse in tests.

When Not To Use

When you cannot send UI data to an external LLM service for privacy or licensing reasons.

When your app's UI relies on heavy visual/layout context that text view hierarchies cannot capture.

Failure Modes

LLM hallucination leading to invalid action sequences.

Instruction forgetting and format errors when many examples are included in the prompt.

Core Entities

Models

ChatGPT (gpt-4)LLaMA70BT5 encoder

Metrics

completion rateaverage LLM costaverage time per test

Datasets

WeChat testing dataset (39,981 task descriptions)Retrieval dataset (37,971 examples)Testing split (2,010 examples)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CAT automates 90% of test tasks.

Average LLM cost per test with CAT is $0.34.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Fine-tune LLMs to ignore misleading retrieved documents and cut RAG hallucinations by ~21%

Key finding

17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Key finding

LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Key finding

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Key finding