Combine MCTS + AI self-critique + offline DPO to train web agents that learn from search traces

August 13, 20249 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

5

Authors

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, Rafael Rafailov

Links

Abstract / PDF

Why It Matters For Business

Train web agents from their own search traces to get large, fast gains in task success without risky online RL; pairing a trained policy with online search gives near‑perfect results for structured web tasks.

Summary TLDR

Agent Q mixes Monte‑Carlo Tree Search (MCTS), an AI critic that scores step choices, and offline Direct Preference Optimization (DPO) to turn search traces into training data. On simulated WebShop and a real OpenTable booking task, this pipeline raises zero-shot success rates substantially (e.g., LLaMA‑3‑70B from 18.6% to 81.7% after one day of data) and reaches 95.4% when paired with online MCTS. The approach focuses on step-level credit assignment via ranked action pairs and trains on both successes and failures, reducing greedy behavior and improving exploration.

Problem Statement

Current LLMs can reason in text but fail to reliably plan and act across many steps in interactive web environments. Static fine-tuning and outcome-only supervision give limited exploration and poor credit assignment. Agent Q asks: can we combine search, step-level self-feedback, and offline preference optimization so agents learn from search traces and improve multi-step web tasks?

Main Contribution

Agent Q pipeline: MCTS over web pages, AI process supervision (critic) for step scores, and node-level DPO training on collected traces.

Off-policy DPO variant that uses logged likelihoods to avoid running a separate reference model.

Empirical scaling to a real-world booking site (OpenTable) with large gains: LLaMA‑3‑70B zero-shot 18.6% → 81.7% zero-shot after one day of data and 95.4% with MCTS.

Key Findings

Agent Q plus test-time MCTS reaches human-level or better performance on WebShop.

NumbersWebShop success: base 28.6% → Agent Q+MCTS 50.5% (human 50.0%)

On a live booking site, training from search traces yields large zero-shot gains.

NumbersOpenTable: LLaMA‑3‑70B zero-shot 18.6% → Agent Q zero-shot 81.7%

Combining trained policy with online MCTS gives the best final success.

NumbersOpenTable: Agent Q zero-shot 81.7% → Agent Q + MCTS 95.4%

Outcome-only RL fine-tuning helps but is weaker than step-level supervision from search.

NumbersWebShop: outcome DPO 40.6% vs Agent Q+MCTS 50.5%; OpenTable: DPO 71.8% vs Agent Q 81.7%

Results

WebShop success rate (base xLAM-v0.1-r)

Value28.6%

WebShop success rate (RFT)

Value31.3%

Baseline28.6%

WebShop success rate (trajectory-level DPO)

Value40.6%

Baseline31.3%

WebShop success rate (base + MCTS at test time)

Value48.4%

Baseline28.6%

WebShop success rate (Agent Q + MCTS)

Value50.5%

Baseline48.4%

OpenTable success rate (LLaMA‑3‑70B zero-shot)

Value18.6%

OpenTable success rate (RFT on 600 trajectories)

Value67.2%

Baseline18.6%

OpenTable success rate (outcome DPO)

Value71.8%

Baseline67.2%

OpenTable success rate (Agent Q zero-shot)

Value81.7%

Baseline71.8%

OpenTable success rate (Agent Q + MCTS)

Value95.4%

Baseline81.7%

Who Should Care

What To Try In 7 Days

Collect a day of agent rollouts with an LLM policy and record DOM traces and outcomes.

Run MCTS over candidate step actions and have the LLM rank proposals to create step-level preferences.

Fine-tune the policy offline with DPO on ranked pairs, then evaluate with a shallow MCTS at inference.

Agent Features

Memory

  • Compact history h_t = (past actions, current DOM)
  • Avoid full DOM trajectories to limit context size

Planning

  • Monte-Carlo Tree Search (MCTS, UCB1)
  • PlanReAct style planning + inner thoughts

Tool Use

  • Web actions: CLICK, TYPE, GOTO, SUBMIT, SCROLL
  • Use of online search at inference

Frameworks

  • Direct Preference Optimization (DPO, offline)
  • Reinforced Fine-Tuning (RFT)
  • Replay buffer / off-policy DPO variant

Is Agentic

true

Architectures

  • autoregressive LLM (LLaMA family, Mixtral)
  • critic/evaluator LLM (GPT-4V used for reward labeling)

Optimization Features

Token Efficiency

  • Compact history representation to save context tokens

Training Optimization

  • Off-policy DPO using logged likelihoods to avoid a separate reference model
  • Construct node-level preference pairs from MCTS traces

Inference Optimization

  • LoRA
  • Sample K proposals per node from base LLM

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on large multimodal models (GPT-4V) for reliable evaluation; open-source alternatives were not used.
  • MCTS on live sites risks irreversible harmful actions and needs rollback or human oversight.
  • Critic model remained frozen; joint critic fine-tuning not explored.
  • Performance was measured on a restricted booking task and a simulated shop, not across many web domains.

When Not To Use

  • On safety-critical sites where actions can cause irreversible harm (banking, payments).
  • When you cannot collect or store detailed DOM/action traces for offline training.
  • If only small models are available and no multimodal evaluator exists.

Failure Modes

  • Greedy behavior: model may prefer first-page matches and avoid exploration without search.
  • Sparse reward credit assignment: long-horizon tasks can still miscredit early steps.
  • Risky online exploration: MCTS rollouts can perform unsafe actions on live sites.

Core Entities

Models

  • LLaMA-3-70B-Instruct
  • xLAM-v0.1-r
  • Mixtral-8x7B-Instruct (base for xLAM)
  • GPT-4V (evaluator/critic)

Metrics

  • Success rate (binary task success)
  • Step count / average trajectory length

Datasets

  • WebShop (simulated e-commerce tasks)
  • OpenTable benchmark (programmatically generated booking queries)

Benchmarks

  • WebShop Yao et al. (2022)
  • Live OpenTable booking environment (this work)