Overview
Production Readiness
0.45
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
2
Why It Matters For Business
A hierarchical agent with DOM denoising and action feedback raises generic web automation success to ~73% and gives actionable signals (self-aware failures) that support safe fallbacks and learning pipelines.
Summary TLDR
Agent-E is a two-tier web automation system: a planner LLM that breaks tasks into subtasks and a browser navigation LLM that executes them using a small set of primitive skills. Key ideas are flexible DOM distillation (three observation modes), DOM de-noising with mmid element IDs, and 'change observation' that returns linguistic feedback after each action. Evaluated on the WebVoyager benchmark (643 tasks), Agent-E achieves 73.2% success, ~20% absolute gain over prior text-only agents and reports additional operational metrics (avg 25 LLM calls per task; successful tasks ~150s). The paper also extracts eight practical design principles for building agentic systems.
Problem Statement
Web agents must act on very noisy, large, and dynamic web pages while fitting inputs into LLM context windows. They struggle with complex widgets, multi-step planning, and dynamic state changes. Success rate alone hides issues like long runtimes, high LLM costs, and silent (oblivious) failures.
Main Contribution
A hierarchical planner + browser-navigation agent architecture that separates planning from low-level actions.
A flexible DOM distillation approach offering three DOM observation modes (text-only, input-fields, all-fields).
A 'change observation' mechanism that reports state changes after actions as linguistic feedback.
A working system (Agent-E) that achieves 73.2% on WebVoyager and outperforms prior text and multi-modal agents on most sites.
Reporting operational metrics beyond success rate: error-awareness, task completion time, and LLM call counts.
A set of eight practical design principles for agentic systems (e.g., primitive skills, denoising, hierarchical design, human-in-loop).
Key Findings
Agent-E reached 73.2% task success on the WebVoyager benchmark.
Agent-E improves absolute success by ~20% over prior text-only agents and ~16% over prior multi-modal agents on evaluated benchmark.
Agent-E used on average 25 LLM calls per task and took ~150s for successful tasks (220s for failed).
Over half of failures were self-aware (agent reports inability) rather than oblivious (wrong answer).
Performance varies widely by site: e.g., WolframAlpha ~95.7% vs Booking.com ~27.3%.
Results
Task success rate (overall)
Per-site extremes
Average LLM calls per task
Average task completion time
Failure self-awareness
Who Should Care
What To Try In 7 Days
Implement a small planner + executor split and run a few web tasks to compare error modes.
Add at least two DOM extraction modes (text vs input-fields) and measure changes.
Return a short textual 'change observation' after every click to improve grounding.
Agent Features
Memory
- short-term per-run (no long-term retrieval described)
Planning
- task decomposition
- verification and backtracking
Tool Use
- primitive skills via function-calling
- DOM observation selection
Frameworks
- Autogen
- Playwright
Is Agentic
true
Architectures
- hierarchical
Collaboration
- human-in-the-loop fallback
Optimization Features
Token Efficiency
- DOM denoising to reduce context tokens
System Optimization
- hierarchical split to reduce planner exposure to noisy DOM
Inference Optimization
- LLM call caching suggested (discussed, not implemented)
Reproducibility
Data Urls
- WebVoyager benchmark (He et al., 2024)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Performance varies strongly by website; some sites (Booking, Flights) remain low.
- Text-only design misses visual cues; no vision-based DOM observation used.
- Primitive skills set is intentionally small; lacks drag, right-click, tab management.
- Evaluation ran during a single region/time window (India IST), affecting timing comparisons.
When Not To Use
- Tasks requiring image or visual understanding (screenshots or OCR).
- Site-specific heavy production workflows without caching or offline optimizations.
- Real-time low-latency requirements where minutes per task is unacceptable.
Failure Modes
- Oblivious failures: partial or wrong answers when the agent overlooks task constraints.
- Technical failures: inability to interact with iframes, canvas, or anti-scraping protections.
- State-desync: dynamic widgets or date formats reset leading to wrong actions.
- High-cost failures: long retries causing many LLM calls before giving up.
Core Entities
Models
- GPT-4-Turbo
Metrics
- task success rate
- task completion time
- LLM calls per task
- self-aware vs oblivious failures
Datasets
- WebVoyager
Benchmarks
- WebVoyager
Context Entities
Models
- Prior agents: Wilbur, WebVoyager multi-modal

