Overview
Production Readiness
0.3
Novelty Score
0.7
Cost Impact Score
0.25
Citation Count
0
Why It Matters For Business
If you deploy agents to draft or revise long reports, expect them to follow edits but also to unintentionally remove or weaken unrelated content and citations, so add verification and human review steps.
Summary TLDR
This paper introduces MR DRE, a benchmark and simulation pipeline to test whether Deep Research Agents (DRAs) can revise long research reports across multiple user feedback turns. Evaluating five commercial and open DRAs on three datasets, the authors find agents usually follow requested edits (>90% incorporation) but commonly degrade unrelated content and citations: 16–27% of previously covered content or citation quality regresses and break rates average ~31% for content feedback and ~21% for format feedback. Multi-turn revision fails to reach the oracle upper bound (9–26% gap by turn 4). Simple runtime fixes (structured edit prompts, a dedicated reviser agent) reduce but do not eliminate破
Problem Statement
Current DRA benchmarks treat report writing as a single-shot task, but humans iteratively revise reports. The paper asks whether DRAs can reliably revise long, cited reports across multiple user feedback turns and provides MR DRE to measure this.
Main Contribution
Define multi-turn report revision as a new evaluation axis for Deep Research Agents.
Release MR DRE: a unified 3‑dimension evaluation protocol (comprehensiveness, factuality, presentation) plus a human-verified feedback simulation pipeline.
Empirically show five diverse DRAs frequently regress on prior content and citations during multi-turn revision and that simple inference-time fixes are insufficient.
Key Findings
Agents follow requested edits but then break unrelated content.
Revision causes measurable citation and factuality loss.
Multi-turn revisions do not reach the ideal accumulation of fixes.
Inference-time remedies help but don't solve core issues.
Results
Incorporation rate
Break rate (content feedback)
Break rate (format feedback)
Regression on previously covered content/citations
Citation faithfulness drop (worst-case Sonar DR)
Oracle gap after 4 turns (coverage)
Effect of Reviser (example OpenAI DR)
Who Should Care
What To Try In 7 Days
Run MR DRE Core Set on your DRA to measure incorporation/break rates.
After any automated revision, run a citation check pass and re-run checklist coverage.
Test a structured edit-plan layer (prompt engineering) or a separate reviser agent and measure break rate reductions.
Agent Features
Memory
- short-term context conditioning on prior drafts
- fails to reliably preserve earlier edits
Planning
- iterative revision loop across turns
Tool Use
- web search APIs (Serper/Google)
- webpage reader (Jina Reader)
- function calling for tool access
Frameworks
- LangChain
- ReAct
Is Agentic
true
Architectures
- search-augmented LLM scaffold
- multi-LLM pipeline with specialized sub-agents
Collaboration
- single-agent with optional reviser sub-agent
Optimization Features
Inference Optimization
- prompt engineering (structured edit plans)
- dedicated reviser sub-agent
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Paper does not fully diagnose causes of high break rates and citation loss.
- Does not evaluate impact of larger backbone/model scaling due to cost constraints.
- Feedback simulation assumes high-quality checklists; sensitivity to poor checklists is not studied.
- MR DRE does not penalize excessive report length, which affects coverage comparisons.
When Not To Use
- If you only need one-shot short answers or paragraph summaries.
- If your system never issues multi-turn edits or does not require citation fidelity.
Failure Modes
- Revisions that remove previously satisfied content outside feedback scope.
- Loss or removal of in-text citations during edits.
- Failure to preserve earlier-turn fixes across multiple turns.
Core Entities
Models
- OpenAI DR (o4-mini deep research)
- Sonar DR (Perplexity)
- LangChain Open Deep Research (LC ODR)
- Tongyi DR
- DR Tulu
- Qwen3-30B-A3B-Instruct (Reviser)
Metrics
- checklist coverage (comprehensiveness)
- citation faithfulness
- claim groundedness
- presentation score
- incorporation rate
- break rate
Datasets
- ResearchRubrics
- RigorousBench
- ResearcherBench
- MR DRE Core Set
Benchmarks
- MR DRE

