Use learned evaluators (VLM+LM) to judge and improve web and device-control agents without extra labels

April 9, 20248 min

Overview

Production Readiness

0.65

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

3

Authors

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr

Links

Abstract / PDF

Why It Matters For Business

Automated evaluators let you measure and improve GUI agents at scale without costly human labels or hand-coded test functions, enabling faster development and safer deployment of web and device automation.

Summary TLDR

The paper builds neural, domain-general evaluators that judge whether an agent completed a GUI task from an instruction, actions, and screenshots. Two designs are tested: end-to-end vision-language models (GPT-4V or QWen-VL-chat) and a modular caption-then-reason pipeline (captioner + language model). Evaluators agree with oracle/human judgments 68–93% depending on model and dataset. Using these evaluators as rewards or filters improves agents: up to +29% relative via Reflexion (inference-time retry) and about +73–75% relative via filtered behavior cloning (training-time filtering) on device-control tasks, all without extra human labels.

Problem Statement

Evaluating and improving digital agents (web navigation and phone control) usually needs handcrafted test functions or human labels. The paper asks: can a learned, domain-general model automatically judge trajectories and then be used to refine agents, without extra expert demonstrations or bespoke evaluators?

Main Contribution

Design and compare two neural evaluator families: end-to-end VLM and modular caption-then-reason.

Collect a small screenshot→detailed-caption dataset (1,263 examples) and fine-tune an open captioner (QWen-VL-chat).

Show evaluators match oracle/human judgments with 68–93% accuracy across WebArena and Android benchmarks.

Demonstrate evaluator-driven refinement: using evaluators in Reflexion yields up to 29% relative gain; filtered behavior cloning yields ~73–75% relative gains on device-control tasks.

Open-source code and data to reproduce evaluator training and refinement experiments.

Key Findings

Modular evaluator (captioner + Mixtral) matched human/oracle judgments with high accuracy on Android

NumbersAndroid agreement 92.9% (Captioner + Mixtral)

Proprietary multimodal model (GPT-4V) performs strongly end-to-end

NumbersWebArena 80.6% / Android 90.6% accuracy

Evaluators enable large relative gains when used for agent refinement

NumbersReflexion improvement up to +29% relative; filtered BC +73–75% relative

Action-matching metrics can mislead on end-to-end success

NumbersAction-matching Kendall τ = 66.7% vs evaluators' τ = 100% with human judges

Evaluator failures are dominated by reasoning and caption gaps

NumbersReasoning errors in 50–70% of cases; caption info loss ≈10% (modular)

Results

Accuracy

Value80.6% (GPT-4V), 68.0% (QWen-VL end-to-end), 74.4% (Captioner+Mixtral), 82.1% (Captioner+GPT-4)

BaselineWebArena oracle evaluator (ground truth test cases)

Accuracy

Value90.6% (GPT-4V), 70.2% (QWen-VL end-to-end), 92.9% (Captioner+Mixtral), 89.8% (Captioner+GPT-4)

BaselineHuman judgments of trajectory success

Refinement via Reflexion (relative improvement)

ValueUp to +29% relative improvement

BaselineGPT-4 WebArena agent baseline success rate

Filtered BC gains (iOS)

ValueBaseline 8/52 (15%) → Filtered BC 14/52 (+75% relative)

BaselineCogAgent on iOS test set (52 tasks)

Filtered BC gains (Android)

ValueBaseline 15 successes → Filtered BC 26 ±0.8 (+73% relative)

BaselineAuto-UI-base on Android test set (96 tasks)

Kendall correlation with human judges

ValueEvaluators: 100% Kendall τ; Action matching: 66.7%

BaselineAction matching (reference-based metric)

Who Should Care

What To Try In 7 Days

Run an existing agent's rollouts through a Captioner+LM evaluator to estimate real task success rather than action-matching.

Use the evaluator to filter training examples (keep high-reward steps) and fine-tune a model via filtered behavior cloning.

Integrate an evaluator into a short Reflexion loop (1–3 retries) to catch recoverable failures at inference time.

Agent Features

Memory

  • short-term trajectory history (per-trajectory or per-step input)
  • actor memory in Reflexion (verbal reflections stored)

Planning

  • inference-time reflection (Reflexion) to retry and revise plans
  • multi-round retry loops (up to 3 rounds)

Tool Use

  • use of emulators (Android/Xcode) for execution
  • OCR (EasyOCR) to augment captioning

Frameworks

  • Reflexion (verbal RL-style refinement)
  • filtered behavior cloning (filtered BC)
  • LoRA

Is Agentic

true

Architectures

  • vision-language model (VLM) + language model (LM)
  • LLM-based actor for WebArena (GPT-4 DOM grounded)
  • policy models: CogAgent, Auto-UI

Optimization Features

Model Optimization

  • fine-tuned QWen-VL captioner
  • LoRA

Training Optimization

  • filtered behavior cloning to keep high-reward state-action pairs
  • self-training baseline (unfiltered fine-tuning) for comparison

Inference Optimization

  • Reflexion inference-time retry using learned evaluator as reward

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluators are imperfect: reasoning errors and caption omissions are common and limit final accuracy.
  • High-performing end-to-end evaluators rely on proprietary models (GPT-4V) and API costs.
  • iOS experiments are constrained by slow physical/macOS emulation and small data.
  • Action representation loses spatial detail when converting pixel-localized actions to text.

When Not To Use

  • When you need provable, high-assurance correctness for safety-critical automation.
  • When low-latency, low-cost single-call inference is mandatory and API costs are prohibitive.
  • If your tasks require precise pixel-level action grounding that text captions cannot capture.

Failure Modes

  • Captioner misses critical visual details, leading to wrong judgments.
  • LM produces convincing but incorrect reasoning that masks evaluator errors.
  • False-negative evaluator calls force successful agent runs to retry and then fail.
  • Reference-based action-matching metrics give inflated performance when demonstrations are flawed.

Core Entities

Models

  • GPT-4V
  • GPT-4
  • QWen-VL-chat (fine-tuned captioner)
  • Mixtral (Mixtral-8x7B-Instruct-v0.1)
  • CogAgent
  • Auto-UI (base and large)
  • GPT-4-based WebArena agent

Metrics

  • Accuracy
  • Task success rate (trajectory-level)
  • Action matching score (reference-based)
  • Kendall correlation with human judges

Datasets

  • WebArena
  • Android-in-the-Wild (AitW)
  • iOS curated tasks (132)
  • Screenshot caption dataset (1,263 images)

Benchmarks

  • WebArena
  • Android-in-the-Wild

Context Entities

Models

  • Qwen-VL-chat (open-weight VLM)
  • Mixtral-8x7B-Instruct-v0.1
  • gpt-4-1106-vision-preview (OpenAI)

Metrics

  • Per-step ternary labels (goal-reached, progress, detrimental)

Datasets

  • WebScreenshot (Dwyer, 2020)
  • Mind2Web
  • AitW train set

Benchmarks

  • VisualWebArena / WebArena (related work)