A reproducible Windows benchmark and baseline agent showing zero-shot multimodal agents still far from humans

September 12, 20249 min

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and reproducible; baseline agent shows where effort should go (visual parsing, UIA integration), but zero-shot results are preliminary and rely on proprietary models for top performance.

Citations2

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui

Links

Abstract / PDF / Code / Data

Why It Matters For Business

WindowsAgentArena lets teams test desktop automation agents in real Windows apps and collect training/evaluation data quickly using cloud parallel runs, shortening iteration time and revealing real gaps between current models and human performance.

Who Should Care

Summary TLDR

WindowsAgentArena is an open-source benchmark that runs agents inside a real Windows 11 VM to test multi-step, multimodal desktop tasks. It ships 154 Windows tasks across apps (Office, browsers, file explorer, VLC, VSCode) and supports fast parallel evaluation on Azure. The paper also releases a multimodal baseline agent, Navi. Best zero-shot Navi reaches 19.5% task success on WindowsAgentArena versus 74.5% for a human; performance improves when precise UI accessibility markers (UIA) are combined with pixel-based Set-of-Marks. The suite and code are available to run locally or at scale for faster iteration and data generation.

Problem Statement

Benchmarks for agents either focus on narrow domains (text-only, web-only, mobile) or run too slowly because realistic multi-step tasks must execute in real operating systems. There is a need for a reproducible, scalable Windows benchmark that exposes agents to real apps, realistic screen content, and fast parallel evaluation to speed research and data generation.

Main Contribution

WindowsAgentArena: 154 reproducible multi-step Windows tasks across common apps and web domains with execution-based evaluators.

A scalable deployment design using Docker + Windows VMs and Azure parallelization that can run full benchmark evaluations in about 20 minutes.

Key Findings

Zero-shot multimodal baseline (Navi) reaches 19.5% task success on WindowsAgentArena.

Numbers19.5% success (Table 4, best config)

Practical UseExpect current generalist VLM agents to solve only a minority of realistic desktop tasks; plan for fine-tuning, specialized modules, or human-in-loop workflows.

Evidence RefTable 4

Human performance on the same tasks is 74.5% success.

Numbers74.5% human success (Table 4 & A.5)

Practical UseBenchmarks should target human-level workflows; use WindowsAgentArena to measure the remaining gap and prioritize improvements in perception and action accuracy.

Evidence RefTable 4; Appendix A.5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
WindowsAgentArena task success (best Navi)19.5%Human 74.5%-55.0ppWindowsAgentArena (154 tasks)Best Navi config (UIA + Omniparser + GPT-4V-1106) achieves 19.5% successTable 4
Human task success74.5%WindowsAgentArena (human participant)Single human run across tasks reports 74.5% overall successTable 4; Appendix A.5

What To Try In 7 Days

Run the repo locally or in Azure to reproduce a subset of tasks and baseline results.

Evaluate your existing agent using UIA + pixel SoM inputs and compare success vs 19.5%.

Collect failed trajectories from 50–100 tasks to prioritize perception fixes (SoM/ARIA) or build small fine-tuning datasets.

Agent Features

Memory
short-term textual memory blockclipboard as temporary storage
Planning
chain-of-thought prompting (reasoning in steps)explicit stepwise python action generation
Tool Use
Computer class API (mouse/keyboard/OS functions)pyautogui wrapperclipboard and window manager controls
Frameworks
Set-of-Marks promptingUIA accessibility tree parsingOmniparser and pixel detectors
Is Agentic

Yes

Architectures
LLM + Visual Language Model (VLM) stackprompted code-output (python) action policy
Collaboration
supports human-in-the-loop decision (discussed as option)

Optimization Features

Infra Optimization
Azure ML job parallelization; pick VMs with nested virtualizationQEMU/KVM snapshot reuse to reduce setup time
System Optimization
Dockerized Windows VM for reproducible runsuse of UIA reduces reliance on expensive pixel search
Training Optimization
RL
Inference Optimization
parallelize independent tasks across cloud workers

Reproducibility

Risks & Boundaries

Limitations

Top agent results rely on proprietary vision/language models and internal detectors; open-source variants perform worse.

Windows 11 VM snapshot cannot be distributed due to licensing; setup requires following repo scripts and obtaining a trial image.

When Not To Use

If your product targets non-Windows OS environments (use AndroidWorld/OSWorld instead).

When you need certified safety audits for autonomous actions without human oversight.

Failure Modes

Incorrect or imprecise Set-of-Marks (SoM) bounding boxes cause wrong element clicks.

Visual–language misalignment: model text describes correct action but selects wrong visual ID.

Core Entities

Models

GPT-4V-1106GPT-4oGPT-4o-miniPhi3-VPhi3

Metrics

Task success rateAccuracyOperation F1Step success rateEvaluation time (median run time)

Datasets

WindowsAgentArena (154 tasks)Mind2Web (processed)

Benchmarks

WindowsAgentArenaOSWorldMind2WebAndroidWorld

Context Entities

Models

SeeAct

Metrics

Human success baseline

Benchmarks

MiniWoB++WebArenaVisualWebArenaWorkArenaMMInA