A realistic benchmark and frozen-web environment for testing web research agents

May 6, 20259 min

Overview

Production Readiness

0.5

Novelty Score

0.4

Cost Impact Score

0.35

Citation Count

0

Authors

FutureSearch, :, Nikos I. Bosse, Jon Evans, Robert G. Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, Jack Wildman

Links

Abstract / PDF

Why It Matters For Business

If your product uses web-research agents, use Deep Research Bench + RetroSearch to track real-world research skills over time and avoid live-web drift; current agents are useful but not yet human-level on hard research tasks.

Summary TLDR

Deep Research Bench is an 89-instance benchmark for evaluating LLM-powered web research agents on realistic, messy research tasks. The authors provide RetroSearch, a frozen snapshot of scraped web pages, and tools to run ReAct-style agents. Top closed models (e.g., o3) reach ~0.51 mean score under low-elicitation prompts; human-level performance is estimated near ~0.8. RetroSearch preserves relative model rankings versus live web runs, enabling repeatable, time-stable evaluations.

Problem Statement

There is no stable, repeatable benchmark that measures how well LLM agents can perform real-world web research tasks while controlling for the constantly changing web. The field needs a realistic task suite and a frozen-web environment to compare agents over time.

Main Contribution

Deep Research Bench: 89 multi-step, real-world web research task instances across 8 task types (Find Number, Find Dataset, Find Original Source, Validate Claim, Derive Number, Gather Evidence, Populate Reference Class, Compile Dataset).

RetroSearch: a frozen, queryable database of scraped web pages plus a Serper-like API to run 'retro' (offline) agent evaluations that mimic live search.

A reproducible agent evaluation stack: ReAct agents with Google Search and Query Document tools, standardized scoring rules, failure-mode taxonomy, and automated trace checks.

An initial evaluation of eleven LLMs and several commercial web research products, plus automated analysis of common failure modes (hallucination, repeated tool calls, forgetting).

Public leaderboard and commitment to continuously update tasks and re-run models (drb.futuresearch.ai).

Key Findings

Best ReAct agent mean score observed was 0.51 (o3 agent).

NumbersBest ReAct score = 0.51 (o3)

Authors estimate a human-noise ceiling for these tasks near 0.8.

NumbersEstimated noise ceiling ≈ 0.8

RetroSearch reproduces relative model rankings vs live web runs, though per-model scores can shift.

NumbersExample: o3 Live 0.51 vs Retro 0.46 (Table 5)

Toolless (no-web) agents perform comparably on Validate Claim: Toolless avg 0.61 vs Live ReAct 0.62.

NumbersValidate Claim: Toolless 0.61 vs Live ReAct 0.62

Common action failure rates per step: hallucination and forgetting vary by model; e.g. DeepSeek-R1 hallucination 0.159, GPT-4 Turbo hallucination 0.019, GPT-4 Turbo forgetting 0.356.

NumbersHallucination: DeepSeek-R1 0.159; GPT-4 Turbo 0.019. Forgetting (GPT-4 Turbo) 0.356 (Table 6)

Forgetting information is the strongest single predictor of lower task scores in their regression (-0.843 coefficient).

NumbersRegression coef forgetting = -0.843, p=0.014

Results

Best ReAct agent mean score

Value0.51 (o3)

Estimated human noise ceiling

Value≈0.8

Live vs Retro rank fidelity

ValueRelative rankings preserved; per-model score shifts (example: o3 Live 0.51 → Retro 0.46)

Validate Claim: Toolless vs Live ReAct

ValueToolless 0.61 vs Live ReAct 0.62

Per-step hallucination rates (examples)

ValueDeepSeek-R1 0.159, GPT-4 Turbo 0.019, Claude 3.7 0.014

Per-step forgetting rates (example)

ValueGPT-4 Turbo forgetting 0.356 (per step)

Who Should Care

What To Try In 7 Days

Run your agent on the RetroSearch snapshot for a few representative tasks to measure regressions over time.

Add a simple memory/state tracker to reduce 'forgetting' and re-run the benchmark subset for impact.

Include a toolless baseline for claim-validation tasks to check whether web access actually adds value.

Agent Features

Memory

  • RetroSearch: frozen web snapshot for repeatable retrieval
  • Tool-based short-term trace (agent loop history)

Planning

  • iteration budget (50 actions)
  • task-specific tip selection pre-prompting

Tool Use

  • Google Search via Serper API
  • Query Document (page read + excerpt)
  • Playwright + HTTP fetch + ScraperAPI for page access

Frameworks

  • RetroSearch
  • ReAct

Is Agentic

true

Architectures

  • ReAct (explicit thought)
  • ReAct (implicit thought for 'thinking' models)

Optimization Features

Token Efficiency

  • Use of large excerpt size (65,536 chars) to reduce repeated reads

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only 89 instances: limited statistical power for fine-grained comparisons.
  • RetroSearch snapshots require heavy upfront crawling and can miss pages, introducing bias.
  • Agents cannot interact with pages (no clicks/scrolls); only static reading is supported.
  • Some scoring relies on LLMs and subjective human judgments, adding noise.
  • Commercial web products were run once and may suffer from single-run variability.

When Not To Use

  • For tasks requiring dynamic UI interaction (clicking, form submission).
  • When you need large-sample statistical certainty across many topic domains.
  • To evaluate model behavior under high-elicitation prompting (paper focuses on low elicitation).

Failure Modes

  • Forgetting earlier findings (state loss across the trace)
  • Repeated or looping tool calls
  • Hallucinated tool calls or hallucinated facts
  • Satisficing: stopping early without thorough cross-checking
  • Gullibility: trusting low-quality sources

Core Entities

Models

  • o3
  • Claude Sonnet 3.7
  • Claude 3.7 Sonnet Non-thinking
  • Gemini 2.5 Pro
  • Gemini 2.5 Flash
  • GPT-4
  • GPT-4.1
  • GPT-4 Turbo
  • Gemma 3
  • Mistral Small
  • DeepSeek-R1
  • DeepSeek-R1 (driver)
  • ChatGPT o3
  • GPT-4.5 (mentioned in product list context)

Metrics

  • Recall
  • Precision
  • F1
  • Binary 0/1 success
  • Absolute difference in assigned probability
  • Per-step failure rates (hallucination, repeated tool calls, forgetting)

Benchmarks

  • Deep Research Bench
  • AgentBench
  • GAIA
  • WebShop
  • WebArena

Context Entities

Models

  • Perplexity Pro
  • Gemini Deep Research
  • OpenAI Deep Research
  • Grok DeepSearch

Datasets

  • Common Crawl (used as fallback)

Benchmarks

  • GAIA
  • AgentBench