Overview
The paper provides a realistic, repeatable benchmark and a frozen-web system; use RetroSearch to compare agents reliably, but expect limitations from a small instance set and scoring subjectivity.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals13
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 2/6
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 35%
Production readiness: 50%
Novelty: 40%
Why It Matters For Business
If your product uses web-research agents, use Deep Research Bench + RetroSearch to track real-world research skills over time and avoid live-web drift; current agents are useful but not yet human-level on hard research tasks.
Who Should Care
Summary TLDR
Deep Research Bench is an 89-instance benchmark for evaluating LLM-powered web research agents on realistic, messy research tasks. The authors provide RetroSearch, a frozen snapshot of scraped web pages, and tools to run ReAct-style agents. Top closed models (e.g., o3) reach ~0.51 mean score under low-elicitation prompts; human-level performance is estimated near ~0.8. RetroSearch preserves relative model rankings versus live web runs, enabling repeatable, time-stable evaluations.
Problem Statement
There is no stable, repeatable benchmark that measures how well LLM agents can perform real-world web research tasks while controlling for the constantly changing web. The field needs a realistic task suite and a frozen-web environment to compare agents over time.
Main Contribution
Deep Research Bench: 89 multi-step, real-world web research task instances across 8 task types (Find Number, Find Dataset, Find Original Source, Validate Claim, Derive Number, Gather Evidence, Populate Reference Class, Compile Dataset).
RetroSearch: a frozen, queryable database of scraped web pages plus a Serper-like API to run 'retro' (offline) agent evaluations that mimic live search.
Key Findings
Best ReAct agent mean score observed was 0.51 (o3 agent).
Authors estimate a human-noise ceiling for these tasks near 0.8.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Best ReAct agent mean score | 0.51 (o3) | — | — | Full 89-instance benchmark | Highest ReAct mean score reported was 0.51 for an o3 agent (Section 3.1, Fig.3) | Section 3.1 |
| Estimated human noise ceiling | ≈0.8 | — | — | Author estimate | Authors estimate a noise ceiling around 0.8 for smart generalist researchers (Section 3.1.1) | Section 3.1.1 |
What To Try In 7 Days
Run your agent on the RetroSearch snapshot for a few representative tasks to measure regressions over time.
Add a simple memory/state tracker to reduce 'forgetting' and re-run the benchmark subset for impact.
Include a toolless baseline for claim-validation tasks to check whether web access actually adds value.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Reproducibility
Risks & Boundaries
Limitations
Only 89 instances: limited statistical power for fine-grained comparisons.
RetroSearch snapshots require heavy upfront crawling and can miss pages, introducing bias.
When Not To Use
For tasks requiring dynamic UI interaction (clicking, form submission).
When you need large-sample statistical certainty across many topic domains.
Failure Modes
Forgetting earlier findings (state loss across the trace)
Repeated or looping tool calls

