WebAgent: combine an HTML-specialist LLM and a code LLM to plan, summarize long pages, and act by generating Python for real websites

July 24, 20239 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

16

Authors

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust

Links

Abstract / PDF

Why It Matters For Business

WebAgent shows a practical path to robust web automation: use a small specialist model to understand long HTML and a capable code-generating LLM to act. That reduces brittle failures on real sites and drastically raises task success in human-supervised runs.

Summary TLDR

This paper builds WebAgent: a modular web automation system that pairs HTML-T5, an HTML-specialist long-context model, for closed-loop planning and extractive summarization with Flan-U-PaLM for grounded Python program synthesis (Selenium). Using self-experience supervision on real sites, WebAgent raises end-to-end success on three real websites from ~10–30% (single LLM baselines) to 65–80% and HTML-T5 alone improves simulated-benchmark success by +18.7% over prior LLM agents. Key practical ideas: local+global attention for long HTML, long-span denoising pretraining (µ={8,64}), closed-loop sub-instruction planning, and acting via executable code.

Problem Statement

Real web pages have three blockers for LLM agents: (1) open-ended actions that cannot be pre-enumerated, (2) HTML documents far longer than typical LLM context windows, and (3) lack of HTML-specific inductive bias in general LLMs. These limit generalization from simulators to real websites and make end-to-end automation brittle.

Main Contribution

WebAgent system: modular two-stage pipeline — HTML-T5 for planning and summarization, Flan-U-PaLM for Python program synthesis.

HTML-T5: a new encoder-decoder model with local+global attention and long-span denoising pretraining on CommonCrawl HTML.

Self-experience supervision: semi-automatic data collection using scripted planners + prompted code generation to fine-tune HTML-T5 on real sites.

Empirical gains: large improvements on real website tasks, MiniWoB++, and Mind2Web benchmarks.

Key Findings

Modular WebAgent dramatically improves real-site success rates.

NumbersSuccess: real-estate 65% vs 10%; social-media 70% vs 20%; map 80% vs 10%

HTML-T5-XL outperforms prior best on MiniWoB++ by +18.7 percentage points.

NumbersHTML-T5-XL 67.1% vs WebN-T5-XL 48.4% (12K demonstrations)

HTML-T5 improves planning/action metrics on Mind2Web generalization.

NumbersElement acc +5.5pp, Op F1 +6pp, Step SR +5.8pp (vs MindAct baseline)

Local+global attention and long-span denoising materially help with long HTML.

NumbersLocal+global attention gives >18% relative gain vs instr.-finetuned dense attention on MiniWoB++

Failure breakdown shows planning remains the hardest error source.

NumbersIn real-estate failures, 70% were incorrect plans (planning error dominant)

Results

real-world success rate (real-estate)

ValueWebAgent 65%, Flan-U-PaLM 10%

Baselinesingle Flan-U-PaLM

real-world success rate (social-media)

ValueWebAgent 70%, Flan-U-PaLM 20%

Baselinesingle Flan-U-PaLM

real-world success rate (map)

ValueWebAgent 80%, Flan-U-PaLM 10%

Baselinesingle Flan-U-PaLM

MiniWoB++ average success

ValueHTML-T5-XL 67.1% (12K demos)

BaselineWebN-T5-XL 48.4% (12K demos)

MiniWoB++ large-scale finetune

ValueHTML-T5-XL 85.6% (347K demos)

BaselineFlan-T5-XXL 79.0%

Accuracy

ValueHTML-T5-XL 60.6%

BaselineMindAct (Flan-T5-XL) 55.1%

Who Should Care

What To Try In 7 Days

Prototype a two-module pipeline: fine-tune an encoder-decoder on your site HTML to predict sub-steps and extract snippets, and use a code-capable LLM to emit Selenium Python.

Collect a few hundred self-experience episodes via scripted plans and automated program execution; use them to fine-tune the HTML model.

Pretrain or adapt a long-context encoder with local+global attention and long-span masking (µ≈8,64) if your pages exceed ~2–4K tokens.

Agent Features

Memory

  • Long-context handling via local+global attention for up to 4096 tokens

Planning

  • Closed-loop sub-instruction planning (iterative decomposition)
  • Planning conditions on HTML summaries and history

Tool Use

  • Generates executable Python (Selenium) as open-ended action space

Frameworks

  • Encoder-decoder HTML-T5 with local+global attention
  • Flan-U-PaLM for conditional program synthesis

Is Agentic

true

Architectures

  • Modular two-stage LLMs: HTML-T5 (encoder-decoder) + Flan-U-PaLM (codegen)

Collaboration

  • Specialist-generalist cooperation: HTML-T5 summarizes and plans, Flan-U-PaLM synthesizes programs

Optimization Features

Token Efficiency

  • Snippet extraction lowers tokens fed to code generator (avoid full-page input)

Infra Optimization

  • Training used TPU-v3 (128 cores) and 4096 token input length for pretraining

Model Optimization

  • Local+global attention in encoder to capture HTML hierarchy

System Optimization

  • Self-experience supervision collects demonstrations with minimal human labeling

Training Optimization

  • Pretrain with mixture of long-span denoising (µ={8,64}) on 3.41M HTML examples
  • Initialize from PEGASUS-style long-text pretraining for stability

Inference Optimization

  • Extractive HTML summarization to reduce prompt length and focus agent on relevant snippets

Reproducibility

Data Urls

  • CommonCrawl (HTML corpus, April 2019)
  • MiniWoB++ (public benchmark)
  • Mind2Web (public dataset)
  • WebSRC (public benchmark)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires a large code-generation LLM (Flan-U-PaLM 540B used) which is expensive to run.
  • Evaluation is human-supervised on a small set of anonymous sites; automated large-scale evaluation remains open.
  • Planning over long horizons still causes most failures; closed-loop planner errors accumulate.
  • Security risks: prompt injection and possible misuse of web automation are noted but not fully solved.

When Not To Use

  • When you cannot run large code-capable LLMs or pay their inference cost.
  • For highly sensitive or private web automation where automated code execution risks data leaks.
  • When sub-second latency and minimal compute are required on-device.

Failure Modes

  • Planning errors: wrong or inconsistent sub-instruction decomposition over long horizons.
  • Programming errors: generated code fails to match sub-instruction or HTML semantics.
  • Summarization errors: retrieved snippets miss task-relevant elements or include noisy links.
  • Security failures: prompt injection or misuse leading to harmful web actions.

Core Entities

Models

  • HTML-T5
  • Flan-U-PaLM (540B)
  • LongT5
  • Flan-LongT5
  • Flan-T5
  • WebN-T5

Metrics

  • success rate
  • score (percent covered attributes)
  • Accuracy
  • operation F1
  • step success rate
  • EM / F1 (WebSRC)

Datasets

  • CommonCrawl (HTML corpus, April 2019)
  • MiniWoB++
  • Mind2Web
  • WebSRC
  • Description Generation (Gur et al., 2022)

Benchmarks

  • MiniWoB++
  • Mind2Web
  • WebSRC