Overview
Production Readiness
0.5
Novelty Score
0.4
Cost Impact Score
0.35
Citation Count
0
Why It Matters For Business
If your product uses web-research agents, use Deep Research Bench + RetroSearch to track real-world research skills over time and avoid live-web drift; current agents are useful but not yet human-level on hard research tasks.
Summary TLDR
Deep Research Bench is an 89-instance benchmark for evaluating LLM-powered web research agents on realistic, messy research tasks. The authors provide RetroSearch, a frozen snapshot of scraped web pages, and tools to run ReAct-style agents. Top closed models (e.g., o3) reach ~0.51 mean score under low-elicitation prompts; human-level performance is estimated near ~0.8. RetroSearch preserves relative model rankings versus live web runs, enabling repeatable, time-stable evaluations.
Problem Statement
There is no stable, repeatable benchmark that measures how well LLM agents can perform real-world web research tasks while controlling for the constantly changing web. The field needs a realistic task suite and a frozen-web environment to compare agents over time.
Main Contribution
Deep Research Bench: 89 multi-step, real-world web research task instances across 8 task types (Find Number, Find Dataset, Find Original Source, Validate Claim, Derive Number, Gather Evidence, Populate Reference Class, Compile Dataset).
RetroSearch: a frozen, queryable database of scraped web pages plus a Serper-like API to run 'retro' (offline) agent evaluations that mimic live search.
A reproducible agent evaluation stack: ReAct agents with Google Search and Query Document tools, standardized scoring rules, failure-mode taxonomy, and automated trace checks.
An initial evaluation of eleven LLMs and several commercial web research products, plus automated analysis of common failure modes (hallucination, repeated tool calls, forgetting).
Public leaderboard and commitment to continuously update tasks and re-run models (drb.futuresearch.ai).
Key Findings
Best ReAct agent mean score observed was 0.51 (o3 agent).
Authors estimate a human-noise ceiling for these tasks near 0.8.
RetroSearch reproduces relative model rankings vs live web runs, though per-model scores can shift.
Toolless (no-web) agents perform comparably on Validate Claim: Toolless avg 0.61 vs Live ReAct 0.62.
Common action failure rates per step: hallucination and forgetting vary by model; e.g. DeepSeek-R1 hallucination 0.159, GPT-4 Turbo hallucination 0.019, GPT-4 Turbo forgetting 0.356.
Forgetting information is the strongest single predictor of lower task scores in their regression (-0.843 coefficient).
Results
Best ReAct agent mean score
Estimated human noise ceiling
Live vs Retro rank fidelity
Validate Claim: Toolless vs Live ReAct
Per-step hallucination rates (examples)
Per-step forgetting rates (example)
Who Should Care
What To Try In 7 Days
Run your agent on the RetroSearch snapshot for a few representative tasks to measure regressions over time.
Add a simple memory/state tracker to reduce 'forgetting' and re-run the benchmark subset for impact.
Include a toolless baseline for claim-validation tasks to check whether web access actually adds value.
Agent Features
Memory
- RetroSearch: frozen web snapshot for repeatable retrieval
- Tool-based short-term trace (agent loop history)
Planning
- iteration budget (50 actions)
- task-specific tip selection pre-prompting
Tool Use
- Google Search via Serper API
- Query Document (page read + excerpt)
- Playwright + HTTP fetch + ScraperAPI for page access
Frameworks
- RetroSearch
- ReAct
Is Agentic
true
Architectures
- ReAct (explicit thought)
- ReAct (implicit thought for 'thinking' models)
Optimization Features
Token Efficiency
- Use of large excerpt size (65,536 chars) to reduce repeated reads
Reproducibility
Data Urls
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only 89 instances: limited statistical power for fine-grained comparisons.
- RetroSearch snapshots require heavy upfront crawling and can miss pages, introducing bias.
- Agents cannot interact with pages (no clicks/scrolls); only static reading is supported.
- Some scoring relies on LLMs and subjective human judgments, adding noise.
- Commercial web products were run once and may suffer from single-run variability.
When Not To Use
- For tasks requiring dynamic UI interaction (clicking, form submission).
- When you need large-sample statistical certainty across many topic domains.
- To evaluate model behavior under high-elicitation prompting (paper focuses on low elicitation).
Failure Modes
- Forgetting earlier findings (state loss across the trace)
- Repeated or looping tool calls
- Hallucinated tool calls or hallucinated facts
- Satisficing: stopping early without thorough cross-checking
- Gullibility: trusting low-quality sources
Core Entities
Models
- o3
- Claude Sonnet 3.7
- Claude 3.7 Sonnet Non-thinking
- Gemini 2.5 Pro
- Gemini 2.5 Flash
- GPT-4
- GPT-4.1
- GPT-4 Turbo
- Gemma 3
- Mistral Small
- DeepSeek-R1
- DeepSeek-R1 (driver)
- ChatGPT o3
- GPT-4.5 (mentioned in product list context)
Metrics
- Recall
- Precision
- F1
- Binary 0/1 success
- Absolute difference in assigned probability
- Per-step failure rates (hallucination, repeated tool calls, forgetting)
Benchmarks
- Deep Research Bench
- AgentBench
- GAIA
- WebShop
- WebArena
Context Entities
Models
- Perplexity Pro
- Gemini Deep Research
- OpenAI Deep Research
- Grok DeepSearch
Datasets
- Common Crawl (used as fallback)
Benchmarks
- GAIA
- AgentBench

