Agent-E: hierarchical web agent with DOM denoising and change-observation — 73.2% on WebVoyager

July 17, 20247 min

Overview

Production Readiness

0.45

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

2

Authors

Tamer Abuelsaad, Deepak Akkil, Prasenjit Dey, Ashish Jagmohan, Aditya Vempaty, Ravi Kokku

Links

Abstract / PDF

Why It Matters For Business

A hierarchical agent with DOM denoising and action feedback raises generic web automation success to ~73% and gives actionable signals (self-aware failures) that support safe fallbacks and learning pipelines.

Summary TLDR

Agent-E is a two-tier web automation system: a planner LLM that breaks tasks into subtasks and a browser navigation LLM that executes them using a small set of primitive skills. Key ideas are flexible DOM distillation (three observation modes), DOM de-noising with mmid element IDs, and 'change observation' that returns linguistic feedback after each action. Evaluated on the WebVoyager benchmark (643 tasks), Agent-E achieves 73.2% success, ~20% absolute gain over prior text-only agents and reports additional operational metrics (avg 25 LLM calls per task; successful tasks ~150s). The paper also extracts eight practical design principles for building agentic systems.

Problem Statement

Web agents must act on very noisy, large, and dynamic web pages while fitting inputs into LLM context windows. They struggle with complex widgets, multi-step planning, and dynamic state changes. Success rate alone hides issues like long runtimes, high LLM costs, and silent (oblivious) failures.

Main Contribution

A hierarchical planner + browser-navigation agent architecture that separates planning from low-level actions.

A flexible DOM distillation approach offering three DOM observation modes (text-only, input-fields, all-fields).

A 'change observation' mechanism that reports state changes after actions as linguistic feedback.

A working system (Agent-E) that achieves 73.2% on WebVoyager and outperforms prior text and multi-modal agents on most sites.

Reporting operational metrics beyond success rate: error-awareness, task completion time, and LLM call counts.

A set of eight practical design principles for agentic systems (e.g., primitive skills, denoising, hierarchical design, human-in-loop).

Key Findings

Agent-E reached 73.2% task success on the WebVoyager benchmark.

Numbers73.2% overall success (WebVoyager)

Agent-E improves absolute success by ~20% over prior text-only agents and ~16% over prior multi-modal agents on evaluated benchmark.

Numbers+20% vs text-only; +16% vs multi-modal (WebVoyager)

Agent-E used on average 25 LLM calls per task and took ~150s for successful tasks (220s for failed).

NumbersAvg 25 LLM calls; TCT success ≈150s; failure ≈220s

Over half of failures were self-aware (agent reports inability) rather than oblivious (wrong answer).

Numbers~52% of failed tasks were self-aware

Performance varies widely by site: e.g., WolframAlpha ~95.7% vs Booking.com ~27.3%.

NumbersWolframAlpha 95.7%, Booking 27.3% (per-site)

Results

Task success rate (overall)

Value73.2%

BaselinePrior text-only ~52.6% (Wilbur)

Per-site extremes

ValueWolframAlpha 95.7% | Booking 27.3%

Average LLM calls per task

Value25 calls (planner ≈6.4, browser ≈18.6)

BaselineNot directly comparable (prior work reports partial components)

Average task completion time

ValueSuccess ≈150s; Failed ≈220s

Failure self-awareness

Value~52% of failed tasks were self-aware

Who Should Care

What To Try In 7 Days

Implement a small planner + executor split and run a few web tasks to compare error modes.

Add at least two DOM extraction modes (text vs input-fields) and measure changes.

Return a short textual 'change observation' after every click to improve grounding.

Agent Features

Memory

  • short-term per-run (no long-term retrieval described)

Planning

  • task decomposition
  • verification and backtracking

Tool Use

  • primitive skills via function-calling
  • DOM observation selection

Frameworks

  • Autogen
  • Playwright

Is Agentic

true

Architectures

  • hierarchical

Collaboration

  • human-in-the-loop fallback

Optimization Features

Token Efficiency

  • DOM denoising to reduce context tokens

System Optimization

  • hierarchical split to reduce planner exposure to noisy DOM

Inference Optimization

  • LLM call caching suggested (discussed, not implemented)

Reproducibility

Data Urls

  • WebVoyager benchmark (He et al., 2024)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance varies strongly by website; some sites (Booking, Flights) remain low.
  • Text-only design misses visual cues; no vision-based DOM observation used.
  • Primitive skills set is intentionally small; lacks drag, right-click, tab management.
  • Evaluation ran during a single region/time window (India IST), affecting timing comparisons.

When Not To Use

  • Tasks requiring image or visual understanding (screenshots or OCR).
  • Site-specific heavy production workflows without caching or offline optimizations.
  • Real-time low-latency requirements where minutes per task is unacceptable.

Failure Modes

  • Oblivious failures: partial or wrong answers when the agent overlooks task constraints.
  • Technical failures: inability to interact with iframes, canvas, or anti-scraping protections.
  • State-desync: dynamic widgets or date formats reset leading to wrong actions.
  • High-cost failures: long retries causing many LLM calls before giving up.

Core Entities

Models

  • GPT-4-Turbo

Metrics

  • task success rate
  • task completion time
  • LLM calls per task
  • self-aware vs oblivious failures

Datasets

  • WebVoyager

Benchmarks

  • WebVoyager

Context Entities

Models

  • Prior agents: Wilbur, WebVoyager multi-modal