Overview
The paper runs systematic experiments across five DRAs and three datasets, with clear metrics (coverage, faithfulness, break/incorporation) and human-validated feedback; results are robust though limited to evaluated agents and model sizes.
Citations0
Evidence Strength0.85
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/7
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 25%
Production readiness: 30%
Novelty: 70%
Why It Matters For Business
If you deploy agents to draft or revise long reports, expect them to follow edits but also to unintentionally remove or weaken unrelated content and citations, so add verification and human review steps.
Who Should Care
Summary TLDR
This paper introduces MR DRE, a benchmark and simulation pipeline to test whether Deep Research Agents (DRAs) can revise long research reports across multiple user feedback turns. Evaluating five commercial and open DRAs on three datasets, the authors find agents usually follow requested edits (>90% incorporation) but commonly degrade unrelated content and citations: 16–27% of previously covered content or citation quality regresses and break rates average ~31% for content feedback and ~21% for format feedback. Multi-turn revision fails to reach the oracle upper bound (9–26% gap by turn 4). Simple runtime fixes (structured edit prompts, a dedicated reviser agent) reduce but do not eliminate破
Problem Statement
Current DRA benchmarks treat report writing as a single-shot task, but humans iteratively revise reports. The paper asks whether DRAs can reliably revise long, cited reports across multiple user feedback turns and provides MR DRE to measure this.
Main Contribution
Define multi-turn report revision as a new evaluation axis for Deep Research Agents.
Release MR DRE: a unified 3‑dimension evaluation protocol (comprehensiveness, factuality, presentation) plus a human-verified feedback simulation pipeline.
Key Findings
Agents follow requested edits but then break unrelated content.
Revision causes measurable citation and factuality loss.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Incorporation rate | mostly >90% | initial drafts | — | averaged across agents/datasets | §5.1: 'incorporation rates mostly exceed 90%' | Table 2 |
| Break rate (content feedback) | ≈31% average | — | — | averaged across agents/datasets | §5.1: 'Break rates average 31% under content feedback' | Table 2 |
What To Try In 7 Days
Run MR DRE Core Set on your DRA to measure incorporation/break rates.
After any automated revision, run a citation check pass and re-run checklist coverage.
Test a structured edit-plan layer (prompt engineering) or a separate reviser agent and measure break rate reductions.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Paper does not fully diagnose causes of high break rates and citation loss.
Does not evaluate impact of larger backbone/model scaling due to cost constraints.
When Not To Use
If you only need one-shot short answers or paragraph summaries.
If your system never issues multi-turn edits or does not require citation fidelity.
Failure Modes
Revisions that remove previously satisfied content outside feedback scope.
Loss or removal of in-text citations during edits.

