Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
WindowsAgentArena lets teams test desktop automation agents in real Windows apps and collect training/evaluation data quickly using cloud parallel runs, shortening iteration time and revealing real gaps between current models and human performance.
Summary TLDR
WindowsAgentArena is an open-source benchmark that runs agents inside a real Windows 11 VM to test multi-step, multimodal desktop tasks. It ships 154 Windows tasks across apps (Office, browsers, file explorer, VLC, VSCode) and supports fast parallel evaluation on Azure. The paper also releases a multimodal baseline agent, Navi. Best zero-shot Navi reaches 19.5% task success on WindowsAgentArena versus 74.5% for a human; performance improves when precise UI accessibility markers (UIA) are combined with pixel-based Set-of-Marks. The suite and code are available to run locally or at scale for faster iteration and data generation.
Problem Statement
Benchmarks for agents either focus on narrow domains (text-only, web-only, mobile) or run too slowly because realistic multi-step tasks must execute in real operating systems. There is a need for a reproducible, scalable Windows benchmark that exposes agents to real apps, realistic screen content, and fast parallel evaluation to speed research and data generation.
Main Contribution
WindowsAgentArena: 154 reproducible multi-step Windows tasks across common apps and web domains with execution-based evaluators.
A scalable deployment design using Docker + Windows VMs and Azure parallelization that can run full benchmark evaluations in about 20 minutes.
Navi: an open multimodal baseline agent with variants that combine UI Automation tree, OCR, pixel detectors, and Set-of-Marks prompting.
Open-source release: code, tasks, and a baseline agent to support reproducible evaluation and data generation.
Key Findings
Zero-shot multimodal baseline (Navi) reaches 19.5% task success on WindowsAgentArena.
Human performance on the same tasks is 74.5% success.
Adding high-quality UI accessibility markers (UIA) plus pixel-based Set-of-Marks markedly improves agent success.
Multimodal Set-of-Marks input improved Mind2Web step success to 45.2%, beating prior SeeAct results (38.3%).
Parallel Azure deployment reduces wall-clock evaluation time to under ~20 minutes for most configurations.
Results
WindowsAgentArena task success (best Navi)
Human task success
Mind2Web step success (best multimodal SoM)
End-to-end evaluation time (median)
Who Should Care
What To Try In 7 Days
Run the repo locally or in Azure to reproduce a subset of tasks and baseline results.
Evaluate your existing agent using UIA + pixel SoM inputs and compare success vs 19.5%.
Collect failed trajectories from 50–100 tasks to prioritize perception fixes (SoM/ARIA) or build small fine-tuning datasets.
Agent Features
Memory
- short-term textual memory block
- clipboard as temporary storage
Planning
- chain-of-thought prompting (reasoning in steps)
- explicit stepwise python action generation
Tool Use
- Computer class API (mouse/keyboard/OS functions)
- pyautogui wrapper
- clipboard and window manager controls
Frameworks
- Set-of-Marks prompting
- UIA accessibility tree parsing
- Omniparser and pixel detectors
Is Agentic
true
Architectures
- LLM + Visual Language Model (VLM) stack
- prompted code-output (python) action policy
Collaboration
- supports human-in-the-loop decision (discussed as option)
Optimization Features
Infra Optimization
- Azure ML job parallelization; pick VMs with nested virtualization
- QEMU/KVM snapshot reuse to reduce setup time
System Optimization
- Dockerized Windows VM for reproducible runs
- use of UIA reduces reliance on expensive pixel search
Training Optimization
- RL
Inference Optimization
- parallelize independent tasks across cloud workers
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Top agent results rely on proprietary vision/language models and internal detectors; open-source variants perform worse.
- Windows 11 VM snapshot cannot be distributed due to licensing; setup requires following repo scripts and obtaining a trial image.
- Single human evaluator for the human baseline; broader human studies are not provided.
- Evaluation focuses on Windows desktop workflows; does not cover mobile or server-only tasks.
When Not To Use
- If your product targets non-Windows OS environments (use AndroidWorld/OSWorld instead).
- When you need certified safety audits for autonomous actions without human oversight.
- If you require end-to-end real-time latency guarantees for production deployments.
Failure Modes
- Incorrect or imprecise Set-of-Marks (SoM) bounding boxes cause wrong element clicks.
- Visual–language misalignment: model text describes correct action but selects wrong visual ID.
- Long-context planning hallucinations in smaller LLMs leading to invalid action sequences.
- Slow or CPU-bound proprietary visual parsers (e.g., Omniparser) extend runtime and reduce throughput.
Core Entities
Models
- GPT-4V-1106
- GPT-4o
- GPT-4o-mini
- Phi3-V
- Phi3
Metrics
- Task success rate
- Accuracy
- Operation F1
- Step success rate
- Evaluation time (median run time)
Datasets
- WindowsAgentArena (154 tasks)
- Mind2Web (processed)
Benchmarks
- WindowsAgentArena
- OSWorld
- Mind2Web
- AndroidWorld
Context Entities
Models
- SeeAct
Metrics
- Human success baseline
Benchmarks
- MiniWoB++
- WebArena
- VisualWebArena
- WorkArena
- MMInA

