Deep Research Agents often break earlier content and citations when asked to revise reports

January 19, 20268 min

Overview

Production Readiness

0.3

Novelty Score

0.7

Cost Impact Score

0.25

Citation Count

0

Authors

Bingsen Chen, Boyan Li, Ping Nie, Yuyu Zhang, Xi Ye, Chen Zhao

Links

Abstract / PDF

Why It Matters For Business

If you deploy agents to draft or revise long reports, expect them to follow edits but also to unintentionally remove or weaken unrelated content and citations, so add verification and human review steps.

Summary TLDR

This paper introduces MR DRE, a benchmark and simulation pipeline to test whether Deep Research Agents (DRAs) can revise long research reports across multiple user feedback turns. Evaluating five commercial and open DRAs on three datasets, the authors find agents usually follow requested edits (>90% incorporation) but commonly degrade unrelated content and citations: 16–27% of previously covered content or citation quality regresses and break rates average ~31% for content feedback and ~21% for format feedback. Multi-turn revision fails to reach the oracle upper bound (9–26% gap by turn 4). Simple runtime fixes (structured edit prompts, a dedicated reviser agent) reduce but do not eliminate破

Problem Statement

Current DRA benchmarks treat report writing as a single-shot task, but humans iteratively revise reports. The paper asks whether DRAs can reliably revise long, cited reports across multiple user feedback turns and provides MR DRE to measure this.

Main Contribution

Define multi-turn report revision as a new evaluation axis for Deep Research Agents.

Release MR DRE: a unified 3‑dimension evaluation protocol (comprehensiveness, factuality, presentation) plus a human-verified feedback simulation pipeline.

Empirically show five diverse DRAs frequently regress on prior content and citations during multi-turn revision and that simple inference-time fixes are insufficient.

Key Findings

Agents follow requested edits but then break unrelated content.

NumbersIncorporation rates mostly >90%; break rates average 31% (content) and 21% (format).

Revision causes measurable citation and factuality loss.

NumbersCitation faithfulness and claim groundedness drop; Sonar DR faithfulness fell up to −67.4%, groundedness −59.1% after an

Multi-turn revisions do not reach the ideal accumulation of fixes.

NumbersBy turn 4, the gap to an oracle upper bound is 9–26% (coverage).

Inference-time remedies help but don't solve core issues.

NumbersReviser reduced break rates (e.g., OpenAI DR break → 10.7%) and improved coverage (+5.1) but citation drops persist.

Results

Incorporation rate

Valuemostly >90%

Baselineinitial drafts

Break rate (content feedback)

Value≈31% average

Break rate (format feedback)

Value≈21% average

Regression on previously covered content/citations

Value16–27% regressions reported

Citation faithfulness drop (worst-case Sonar DR)

Value−67.4% faithfulness, −59.1% groundedness

Baselineinitial reports

Oracle gap after 4 turns (coverage)

Value9–26% gap to oracle

Baselineoracle upper bound (perfect incorporation, zero break)

Effect of Reviser (example OpenAI DR)

Valuecoverage +5.1, break 10.7%

BaselineOpenAI DR without fix (coverage −3.7, break 29.6%)

Who Should Care

What To Try In 7 Days

Run MR DRE Core Set on your DRA to measure incorporation/break rates.

After any automated revision, run a citation check pass and re-run checklist coverage.

Test a structured edit-plan layer (prompt engineering) or a separate reviser agent and measure break rate reductions.

Agent Features

Memory

  • short-term context conditioning on prior drafts
  • fails to reliably preserve earlier edits

Planning

  • iterative revision loop across turns

Tool Use

  • web search APIs (Serper/Google)
  • webpage reader (Jina Reader)
  • function calling for tool access

Frameworks

  • LangChain
  • ReAct

Is Agentic

true

Architectures

  • search-augmented LLM scaffold
  • multi-LLM pipeline with specialized sub-agents

Collaboration

  • single-agent with optional reviser sub-agent

Optimization Features

Inference Optimization

  • prompt engineering (structured edit plans)
  • dedicated reviser sub-agent

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Paper does not fully diagnose causes of high break rates and citation loss.
  • Does not evaluate impact of larger backbone/model scaling due to cost constraints.
  • Feedback simulation assumes high-quality checklists; sensitivity to poor checklists is not studied.
  • MR DRE does not penalize excessive report length, which affects coverage comparisons.

When Not To Use

  • If you only need one-shot short answers or paragraph summaries.
  • If your system never issues multi-turn edits or does not require citation fidelity.

Failure Modes

  • Revisions that remove previously satisfied content outside feedback scope.
  • Loss or removal of in-text citations during edits.
  • Failure to preserve earlier-turn fixes across multiple turns.

Core Entities

Models

  • OpenAI DR (o4-mini deep research)
  • Sonar DR (Perplexity)
  • LangChain Open Deep Research (LC ODR)
  • Tongyi DR
  • DR Tulu
  • Qwen3-30B-A3B-Instruct (Reviser)

Metrics

  • checklist coverage (comprehensiveness)
  • citation faithfulness
  • claim groundedness
  • presentation score
  • incorporation rate
  • break rate

Datasets

  • ResearchRubrics
  • RigorousBench
  • ResearcherBench
  • MR DRE Core Set

Benchmarks

  • MR DRE