A reproducible Windows benchmark and baseline agent showing zero-shot multimodal agents still far from humans

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and reproducible; baseline agent shows where effort should go (visual parsing, UIA integration), but zero-shot results are preliminary and rely on proprietary models for top performance.

Citations2

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui

Links

Abstract / PDF / Code / Data

Why It Matters For Business

WindowsAgentArena lets teams test desktop automation agents in real Windows apps and collect training/evaluation data quickly using cloud parallel runs, shortening iteration time and revealing real gaps between current models and human performance.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

WindowsAgentArena is an open-source benchmark that runs agents inside a real Windows 11 VM to test multi-step, multimodal desktop tasks. It ships 154 Windows tasks across apps (Office, browsers, file explorer, VLC, VSCode) and supports fast parallel evaluation on Azure. The paper also releases a multimodal baseline agent, Navi. Best zero-shot Navi reaches 19.5% task success on WindowsAgentArena versus 74.5% for a human; performance improves when precise UI accessibility markers (UIA) are combined with pixel-based Set-of-Marks. The suite and code are available to run locally or at scale for faster iteration and data generation.

Problem Statement

Benchmarks for agents either focus on narrow domains (text-only, web-only, mobile) or run too slowly because realistic multi-step tasks must execute in real operating systems. There is a need for a reproducible, scalable Windows benchmark that exposes agents to real apps, realistic screen content, and fast parallel evaluation to speed research and data generation.

Main Contribution

WindowsAgentArena: 154 reproducible multi-step Windows tasks across common apps and web domains with execution-based evaluators.

A scalable deployment design using Docker + Windows VMs and Azure parallelization that can run full benchmark evaluations in about 20 minutes.

Key Findings

Zero-shot multimodal baseline (Navi) reaches 19.5% task success on WindowsAgentArena.

Numbers19.5% success (Table 4, best config)

Practical UseExpect current generalist VLM agents to solve only a minority of realistic desktop tasks; plan for fine-tuning, specialized modules, or human-in-loop workflows.

Evidence RefTable 4

Human performance on the same tasks is 74.5% success.

Numbers74.5% human success (Table 4 & A.5)

Practical UseBenchmarks should target human-level workflows; use WindowsAgentArena to measure the remaining gap and prioritize improvements in perception and action accuracy.

Evidence RefTable 4; Appendix A.5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
WindowsAgentArena task success (best Navi)	19.5%	Human 74.5%	-55.0pp	WindowsAgentArena (154 tasks)	Best Navi config (UIA + Omniparser + GPT-4V-1106) achieves 19.5% success	Table 4
Human task success	74.5%	—	—	WindowsAgentArena (human participant)	Single human run across tasks reports 74.5% overall success	Table 4; Appendix A.5

What To Try In 7 Days

Run the repo locally or in Azure to reproduce a subset of tasks and baseline results.

Evaluate your existing agent using UIA + pixel SoM inputs and compare success vs 19.5%.

Collect failed trajectories from 50–100 tasks to prioritize perception fixes (SoM/ARIA) or build small fine-tuning datasets.

Agent Features

Memory

short-term textual memory blockclipboard as temporary storage

Planning

chain-of-thought prompting (reasoning in steps)explicit stepwise python action generation

Tool Use

Computer class API (mouse/keyboard/OS functions)pyautogui wrapperclipboard and window manager controls

Frameworks

Set-of-Marks promptingUIA accessibility tree parsingOmniparser and pixel detectors

Is Agentic

Yes

Architectures

LLM + Visual Language Model (VLM) stackprompted code-output (python) action policy

Collaboration

supports human-in-the-loop decision (discussed as option)

Optimization Features

Infra Optimization

Azure ML job parallelization; pick VMs with nested virtualizationQEMU/KVM snapshot reuse to reduce setup time

System Optimization

Dockerized Windows VM for reproducible runsuse of UIA reduces reliance on expensive pixel search

Training Optimization

Inference Optimization

parallelize independent tasks across cloud workers

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/microsoft/WindowsAgentArena https://microsoft.github.io/WindowsAgentArena

Data URLs

https://github.com/microsoft/WindowsAgentArena

Risks & Boundaries

Limitations

Top agent results rely on proprietary vision/language models and internal detectors; open-source variants perform worse.

Windows 11 VM snapshot cannot be distributed due to licensing; setup requires following repo scripts and obtaining a trial image.

When Not To Use

If your product targets non-Windows OS environments (use AndroidWorld/OSWorld instead).

When you need certified safety audits for autonomous actions without human oversight.

Failure Modes

Incorrect or imprecise Set-of-Marks (SoM) bounding boxes cause wrong element clicks.

Visual–language misalignment: model text describes correct action but selects wrong visual ID.

Core Entities

Models

GPT-4V-1106GPT-4oGPT-4o-miniPhi3-VPhi3

Metrics

Task success rateAccuracyOperation F1Step success rateEvaluation time (median run time)

Datasets

WindowsAgentArena (154 tasks)Mind2Web (processed)

Benchmarks

WindowsAgentArenaOSWorldMind2WebAndroidWorld

Context Entities

Models

SeeAct

Metrics

Human success baseline

Benchmarks

MiniWoB++WebArenaVisualWebArenaWorkArenaMMInA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Zero-shot multimodal baseline (Navi) reaches 19.5% task success on WindowsAgentArena.

Human performance on the same tasks is 74.5% success.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-