A reproducible Windows benchmark and baseline agent showing zero-shot multimodal agents still far from humans

September 12, 20249 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

2

Authors

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui

Links

Abstract / PDF

Why It Matters For Business

WindowsAgentArena lets teams test desktop automation agents in real Windows apps and collect training/evaluation data quickly using cloud parallel runs, shortening iteration time and revealing real gaps between current models and human performance.

Summary TLDR

WindowsAgentArena is an open-source benchmark that runs agents inside a real Windows 11 VM to test multi-step, multimodal desktop tasks. It ships 154 Windows tasks across apps (Office, browsers, file explorer, VLC, VSCode) and supports fast parallel evaluation on Azure. The paper also releases a multimodal baseline agent, Navi. Best zero-shot Navi reaches 19.5% task success on WindowsAgentArena versus 74.5% for a human; performance improves when precise UI accessibility markers (UIA) are combined with pixel-based Set-of-Marks. The suite and code are available to run locally or at scale for faster iteration and data generation.

Problem Statement

Benchmarks for agents either focus on narrow domains (text-only, web-only, mobile) or run too slowly because realistic multi-step tasks must execute in real operating systems. There is a need for a reproducible, scalable Windows benchmark that exposes agents to real apps, realistic screen content, and fast parallel evaluation to speed research and data generation.

Main Contribution

WindowsAgentArena: 154 reproducible multi-step Windows tasks across common apps and web domains with execution-based evaluators.

A scalable deployment design using Docker + Windows VMs and Azure parallelization that can run full benchmark evaluations in about 20 minutes.

Navi: an open multimodal baseline agent with variants that combine UI Automation tree, OCR, pixel detectors, and Set-of-Marks prompting.

Open-source release: code, tasks, and a baseline agent to support reproducible evaluation and data generation.

Key Findings

Zero-shot multimodal baseline (Navi) reaches 19.5% task success on WindowsAgentArena.

Numbers19.5% success (Table 4, best config)

Human performance on the same tasks is 74.5% success.

Numbers74.5% human success (Table 4 & A.5)

Adding high-quality UI accessibility markers (UIA) plus pixel-based Set-of-Marks markedly improves agent success.

NumbersUIA addition boosted Omniparser performance by 57% with GPT-4V-1106

Multimodal Set-of-Marks input improved Mind2Web step success to 45.2%, beating prior SeeAct results (38.3%).

Numbers45.2% step success vs 38.3% (Table 5)

Parallel Azure deployment reduces wall-clock evaluation time to under ~20 minutes for most configurations.

NumbersFull runs usually <20 min; Omniparser runs slower (Table 11)

Results

WindowsAgentArena task success (best Navi)

Value19.5%

BaselineHuman 74.5%

Human task success

Value74.5%

Mind2Web step success (best multimodal SoM)

Value45.2%

BaselineSeeAct 38.3%

End-to-end evaluation time (median)

Value≈16–21 min typical; Omniparser slower (39–82 min)

Baselineserial local runs often hours to days

Who Should Care

What To Try In 7 Days

Run the repo locally or in Azure to reproduce a subset of tasks and baseline results.

Evaluate your existing agent using UIA + pixel SoM inputs and compare success vs 19.5%.

Collect failed trajectories from 50–100 tasks to prioritize perception fixes (SoM/ARIA) or build small fine-tuning datasets.

Agent Features

Memory

  • short-term textual memory block
  • clipboard as temporary storage

Planning

  • chain-of-thought prompting (reasoning in steps)
  • explicit stepwise python action generation

Tool Use

  • Computer class API (mouse/keyboard/OS functions)
  • pyautogui wrapper
  • clipboard and window manager controls

Frameworks

  • Set-of-Marks prompting
  • UIA accessibility tree parsing
  • Omniparser and pixel detectors

Is Agentic

true

Architectures

  • LLM + Visual Language Model (VLM) stack
  • prompted code-output (python) action policy

Collaboration

  • supports human-in-the-loop decision (discussed as option)

Optimization Features

Infra Optimization

  • Azure ML job parallelization; pick VMs with nested virtualization
  • QEMU/KVM snapshot reuse to reduce setup time

System Optimization

  • Dockerized Windows VM for reproducible runs
  • use of UIA reduces reliance on expensive pixel search

Training Optimization

  • RL

Inference Optimization

  • parallelize independent tasks across cloud workers

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Top agent results rely on proprietary vision/language models and internal detectors; open-source variants perform worse.
  • Windows 11 VM snapshot cannot be distributed due to licensing; setup requires following repo scripts and obtaining a trial image.
  • Single human evaluator for the human baseline; broader human studies are not provided.
  • Evaluation focuses on Windows desktop workflows; does not cover mobile or server-only tasks.

When Not To Use

  • If your product targets non-Windows OS environments (use AndroidWorld/OSWorld instead).
  • When you need certified safety audits for autonomous actions without human oversight.
  • If you require end-to-end real-time latency guarantees for production deployments.

Failure Modes

  • Incorrect or imprecise Set-of-Marks (SoM) bounding boxes cause wrong element clicks.
  • Visual–language misalignment: model text describes correct action but selects wrong visual ID.
  • Long-context planning hallucinations in smaller LLMs leading to invalid action sequences.
  • Slow or CPU-bound proprietary visual parsers (e.g., Omniparser) extend runtime and reduce throughput.

Core Entities

Models

  • GPT-4V-1106
  • GPT-4o
  • GPT-4o-mini
  • Phi3-V
  • Phi3

Metrics

  • Task success rate
  • Accuracy
  • Operation F1
  • Step success rate
  • Evaluation time (median run time)

Datasets

  • WindowsAgentArena (154 tasks)
  • Mind2Web (processed)

Benchmarks

  • WindowsAgentArena
  • OSWorld
  • Mind2Web
  • AndroidWorld

Context Entities

Models

  • SeeAct

Metrics

  • Human success baseline

Benchmarks

  • MiniWoB++
  • WebArena
  • VisualWebArena
  • WorkArena
  • MMInA