Use learned evaluators (VLM+LM) to judge and improve web and device-control agents without extra labels

April 9, 20248 min

Overview

Decision SnapshotNeeds Validation

The approach is practical: open-weight modular evaluators work well and proprietary VLMs give stronger accuracy; evaluators require engineering to reduce hallucination and to scale emulation.

Citations3

Evidence Strength0.85

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 65%

Novelty: 60%

Authors

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated evaluators let you measure and improve GUI agents at scale without costly human labels or hand-coded test functions, enabling faster development and safer deployment of web and device automation.

Who Should Care

Summary TLDR

The paper builds neural, domain-general evaluators that judge whether an agent completed a GUI task from an instruction, actions, and screenshots. Two designs are tested: end-to-end vision-language models (GPT-4V or QWen-VL-chat) and a modular caption-then-reason pipeline (captioner + language model). Evaluators agree with oracle/human judgments 68–93% depending on model and dataset. Using these evaluators as rewards or filters improves agents: up to +29% relative via Reflexion (inference-time retry) and about +73–75% relative via filtered behavior cloning (training-time filtering) on device-control tasks, all without extra human labels.

Problem Statement

Evaluating and improving digital agents (web navigation and phone control) usually needs handcrafted test functions or human labels. The paper asks: can a learned, domain-general model automatically judge trajectories and then be used to refine agents, without extra expert demonstrations or bespoke evaluators?

Main Contribution

Design and compare two neural evaluator families: end-to-end VLM and modular caption-then-reason.

Collect a small screenshot→detailed-caption dataset (1,263 examples) and fine-tune an open captioner (QWen-VL-chat).

Key Findings

Modular evaluator (captioner + Mixtral) matched human/oracle judgments with high accuracy on Android

NumbersAndroid agreement 92.9% (Captioner + Mixtral)

Practical UseYou can build a locally-run, open-weight evaluation pipeline that reliably matches human judgments on Android-type GUI tasks; use it to filter or score trajectories.

Evidence RefTable 1; Section 4.1

Proprietary multimodal model (GPT-4V) performs strongly end-to-end

NumbersWebArena 80.6% / Android 90.6% accuracy

Practical UseIf budget and API access allow, GPT-4V gives strong single-call evaluation without building a captioner, at higher cost.

Evidence RefTable 1; Section 4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy80.6% (GPT-4V), 68.0% (QWen-VL end-to-end), 74.4% (Captioner+Mixtral), 82.1% (Captioner+GPT-4)WebArena oracle evaluator (ground truth test cases)WebArena (trajectory-level)Table 1 reports evaluator agreement percentages on WebArena.Section 4.1; Table 1
Accuracy90.6% (GPT-4V), 70.2% (QWen-VL end-to-end), 92.9% (Captioner+Mixtral), 89.8% (Captioner+GPT-4)Human judgments of trajectory successAndroid-in-the-Wild subset (120 tasks)Table 1 and Section 4.1 show agreement with human labels on Android.Section 4.1; Table 1

What To Try In 7 Days

Run an existing agent's rollouts through a Captioner+LM evaluator to estimate real task success rather than action-matching.

Use the evaluator to filter training examples (keep high-reward steps) and fine-tune a model via filtered behavior cloning.

Integrate an evaluator into a short Reflexion loop (1–3 retries) to catch recoverable failures at inference time.

Agent Features

Memory
short-term trajectory history (per-trajectory or per-step input)actor memory in Reflexion (verbal reflections stored)
Planning
inference-time reflection (Reflexion) to retry and revise plansmulti-round retry loops (up to 3 rounds)
Tool Use
use of emulators (Android/Xcode) for executionOCR (EasyOCR) to augment captioning
Frameworks
Reflexion (verbal RL-style refinement)filtered behavior cloning (filtered BC)LoRA
Is Agentic

Yes

Architectures
vision-language model (VLM) + language model (LM)LLM-based actor for WebArena (GPT-4 DOM grounded)policy models: CogAgent, Auto-UI

Optimization Features

Model Optimization
fine-tuned QWen-VL captionerLoRA
Training Optimization
filtered behavior cloning to keep high-reward state-action pairsself-training baseline (unfiltered fine-tuning) for comparison
Inference Optimization
Reflexion inference-time retry using learned evaluator as reward

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluators are imperfect: reasoning errors and caption omissions are common and limit final accuracy.

High-performing end-to-end evaluators rely on proprietary models (GPT-4V) and API costs.

When Not To Use

When you need provable, high-assurance correctness for safety-critical automation.

When low-latency, low-cost single-call inference is mandatory and API costs are prohibitive.

Failure Modes

Captioner misses critical visual details, leading to wrong judgments.

LM produces convincing but incorrect reasoning that masks evaluator errors.

Core Entities

Models

GPT-4VGPT-4QWen-VL-chat (fine-tuned captioner)Mixtral (Mixtral-8x7B-Instruct-v0.1)CogAgentAuto-UI (base and large)GPT-4-based WebArena agent

Metrics

AccuracyTask success rate (trajectory-level)Action matching score (reference-based)Kendall correlation with human judges

Datasets

WebArenaAndroid-in-the-Wild (AitW)iOS curated tasks (132)Screenshot caption dataset (1,263 images)

Benchmarks

WebArenaAndroid-in-the-Wild

Context Entities

Models

Qwen-VL-chat (open-weight VLM)Mixtral-8x7B-Instruct-v0.1gpt-4-1106-vision-preview (OpenAI)

Metrics

Per-step ternary labels (goal-reached, progress, detrimental)

Datasets

WebScreenshot (Dwyer, 2020)Mind2WebAitW train set

Benchmarks

VisualWebArena / WebArena (related work)