Use one-shot retrieval + light ML to run cheap, reliable UI tests at scale

September 12, 20248 min

Overview

Production Readiness

0.8

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

0

Authors

Sidong Feng, Haochuan Lu, Jianqin Jiang, Ting Xiong, Likun Huang, Yinglin Liang, Xiaoqin Li, Yuetang Deng, Aldeida Aleti

Links

Abstract / PDF

Why It Matters For Business

You can automate most feature-level UI tests cheaply by combining a single retrieved example with ML-based element matching and calling LLMs only for ambiguous cases, cutting LLM spend while keeping test quality.

Summary TLDR

CAT is a hybrid system that combines retrieval-augmented LLM prompting (one-shot RAG) with lightweight machine learning to generate and execute UI automation tests for a large industrial app (WeChat). On a 39k task dataset CAT reaches 90% automated completion at $0.34 per task, matches AdbGPT's success rate but cuts LLM cost by ~68%. CAT is integrated into WeChat and ran 6.3k tests in six months, finding 141 bugs.

Problem Statement

Generating robust UI automation tests for industrial apps is costly and brittle. Pure ML or pure LLM solutions either fail on element mismatch or are too expensive at scale. The paper targets cost and knowledge gaps when applying LLMs to industry-level UI testing.

Main Contribution

CAT: a two-phase system that uses retrieval-augmented LLM prompts to decompose task descriptions, then ML plus LLM optimization to map actions to UI elements.

A practical RAG design: top-1 example retrieval (one-shot) using T5 embeddings + cosine similarity to provide few-shot context.

Large-scale evaluation on 39k WeChat tasks with ablations and comparisons to prior methods, plus a real-world integration showing bug detection.

Empirical evidence that hybrid ML+LLM yields similar completion rates to LLM-first methods while reducing LLM cost significantly.

Key Findings

CAT automates 90% of test tasks.

Numbers90% completion (Table 3, CAT)

Average LLM cost per test with CAT is $0.34.

Numbers$0.34 per test (Table 3)

CAT matches AdbGPT's 90% success but uses much less LLM spend.

NumbersAdbGPT $1.07 vs CAT $0.34; saves $1,467 stated

One-shot retrieval raises completion by 40% vs zero-shot.

Numbers0-shot 50% -> 1-shot 90% (40 percentage points) (Table 3)

LLM-based optimizer improves element mapping by ~38 percentage points vs no optimizer.

NumbersCAT vs CAT (no optimizer): 90% vs 52% (38 pp) (Table 3)

Real-world deployment ran 6.3k tests and found 141 bugs.

Numbers6,300 runs -> 141 bugs (Section 3.3)

Results

Completion rate

Value90%

BaselineCAT (no optimizer) 52%

Average LLM cost

Value$0.34

BaselineAdbGPT $1.07

Average time per test

Value2.65 min

BaselineSeq2Act 5.89 min

Real-world runs

Value6,300 automated runs

Who Should Care

What To Try In 7 Days

Index past test scripts and build T5 embeddings for retrieval.

Prompt an off-the-shelf LLM with one retrieved example to generate action sequences.

Use a cheap ML matcher for element mapping and call LLM only when similarity is below a threshold.

Agent Features

Memory

  • retrieval memory (indexed past usages)

Planning

  • task decomposition

Tool Use

  • RAG retrieval
  • LLM as optimizer

Frameworks

  • hybrid ML + LLM

Optimization Features

Token Efficiency

  • one-shot context reduces tokens vs N-shot

Infra Optimization

  • option to replace ChatGPT with open models (LLaMA70B tested)

System Optimization

  • hybrid ML primary + LLM secondary to lower cost

Inference Optimization

  • call LLMs only for low-similarity cases
  • use top-1 retrieval to shrink prompt size

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Evaluation is limited to WeChat; generality to other apps is claimed but not fully measured.
  • Relies on ChatGPT (gpt-4) in experiments; open models performed worse in tests.
  • Long UI view hierarchies need simplification; visual UI understanding remains weak.
  • No public code or dataset release; reproducing results may be hard for outsiders.

When Not To Use

  • When you cannot send UI data to an external LLM service for privacy or licensing reasons.
  • When your app's UI relies on heavy visual/layout context that text view hierarchies cannot capture.
  • When per-test LLM cost must be essentially zero and you cannot tolerate any third-party model calls.

Failure Modes

  • LLM hallucination leading to invalid action sequences.
  • Instruction forgetting and format errors when many examples are included in the prompt.
  • ML matcher false positives on similar elements when semantic gap is large.

Core Entities

Models

  • ChatGPT (gpt-4)
  • LLaMA70B
  • T5 encoder

Metrics

  • completion rate
  • average LLM cost
  • average time per test

Datasets

  • WeChat testing dataset (39,981 task descriptions)
  • Retrieval dataset (37,971 examples)
  • Testing split (2,010 examples)