Overview
The benchmark is practical and reproducible; baseline agent shows where effort should go (visual parsing, UIA integration), but zero-shot results are preliminary and rely on proprietary models for top performance.
Citations2
Evidence Strength0.70
Confidence0.90
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
WindowsAgentArena lets teams test desktop automation agents in real Windows apps and collect training/evaluation data quickly using cloud parallel runs, shortening iteration time and revealing real gaps between current models and human performance.
Who Should Care
Summary TLDR
WindowsAgentArena is an open-source benchmark that runs agents inside a real Windows 11 VM to test multi-step, multimodal desktop tasks. It ships 154 Windows tasks across apps (Office, browsers, file explorer, VLC, VSCode) and supports fast parallel evaluation on Azure. The paper also releases a multimodal baseline agent, Navi. Best zero-shot Navi reaches 19.5% task success on WindowsAgentArena versus 74.5% for a human; performance improves when precise UI accessibility markers (UIA) are combined with pixel-based Set-of-Marks. The suite and code are available to run locally or at scale for faster iteration and data generation.
Problem Statement
Benchmarks for agents either focus on narrow domains (text-only, web-only, mobile) or run too slowly because realistic multi-step tasks must execute in real operating systems. There is a need for a reproducible, scalable Windows benchmark that exposes agents to real apps, realistic screen content, and fast parallel evaluation to speed research and data generation.
Main Contribution
WindowsAgentArena: 154 reproducible multi-step Windows tasks across common apps and web domains with execution-based evaluators.
A scalable deployment design using Docker + Windows VMs and Azure parallelization that can run full benchmark evaluations in about 20 minutes.
Key Findings
Zero-shot multimodal baseline (Navi) reaches 19.5% task success on WindowsAgentArena.
Human performance on the same tasks is 74.5% success.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| WindowsAgentArena task success (best Navi) | 19.5% | Human 74.5% | -55.0pp | WindowsAgentArena (154 tasks) | Best Navi config (UIA + Omniparser + GPT-4V-1106) achieves 19.5% success | Table 4 |
| Human task success | 74.5% | — | — | WindowsAgentArena (human participant) | Single human run across tasks reports 74.5% overall success | Table 4; Appendix A.5 |
What To Try In 7 Days
Run the repo locally or in Azure to reproduce a subset of tasks and baseline results.
Evaluate your existing agent using UIA + pixel SoM inputs and compare success vs 19.5%.
Collect failed trajectories from 50–100 tasks to prioritize perception fixes (SoM/ARIA) or build small fine-tuning datasets.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Top agent results rely on proprietary vision/language models and internal detectors; open-source variants perform worse.
Windows 11 VM snapshot cannot be distributed due to licensing; setup requires following repo scripts and obtaining a trial image.
When Not To Use
If your product targets non-Windows OS environments (use AndroidWorld/OSWorld instead).
When you need certified safety audits for autonomous actions without human oversight.
Failure Modes
Incorrect or imprecise Set-of-Marks (SoM) bounding boxes cause wrong element clicks.
Visual–language misalignment: model text describes correct action but selects wrong visual ID.

