Overview
The paper includes benchmark results and ablations that support the design claims, but some ablations are limited to GAIA and system latency/token costs are not measured.
Citations0
Evidence Strength0.70
Confidence0.82
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Modular orchestration and structured memory reduce failed web/data tasks and improve first-run correctness, lowering human oversight and accelerating research pipelines.
Who Should Care
Summary TLDR
Yunque DeepResearch is a hierarchical, modular multi-agent framework for long-horizon research tasks. It groups interactions into sub-goal memory units, routes subtasks to an atomic pool of tools and sub-agents (browser GUI, data-analysis), and adds a Supervisor that interrupts and prunes failing trajectories. On four agentic benchmarks it reports Pass@1: BrowseComp 62.5, BrowseComp-ZH 75.9, GAIA 78.6, HLE 51.7 and shows model-agnostic gains (e.g., +10.0 on BrowseComp for Gemini 3 Pro). The code is claimed open-source; implementation and ablations are included.
Problem Statement
Current deep-research agents struggle with three practical problems: 1) context noise and information overload in long-horizon tasks; 2) fragile execution that causes cascading errors and loops; 3) rigid architectures that block integrating specialized tools and sub-agents.
Main Contribution
A hierarchical multi-agent architecture that decouples planning from execution via a Main Agent, Context Manager, Atomic Capability Pool, and Supervisor.
Sub-goal-driven structured memory that folds multi-round traces into concise memory units and switches context between detailed traces and compressed summaries.
Key Findings
Yunque achieves top Pass@1 on three browsing/data benchmarks.
Yunque is second on GAIA and improves base model performance.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Pass@1 (BrowseComp) | 62.5 | Best prior reported ~60.2 (Kimi K2 thinking / other baselines vary) | +~10.0 vs Gemini 3 Pro ReAct baseline | BrowseComp | Main Results Table 1 | Table 1 |
| Pass@1 (BrowseComp-ZH) | 75.9 | Prior baselines reported mid-60s | +8.8 vs Gemini 3 Pro ReAct (per Table 2) | BrowseComp-ZH | Main Results Table 1 | Table 1 |
What To Try In 7 Days
Add sub-goal memory: fold completed subtasks into compact summaries to cut context noise.
Introduce a lightweight Supervisor to detect and prune failed traces before retries.
Prototype a Browser + Data Analysis sub-agent pair with a one-tool-per-turn constraint.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Reproducibility
Risks & Boundaries
Limitations
Ablations on specialized sub-agents are mainly limited to GAIA and may not generalize to all domains.
No systematic measurement of token consumption, latency, or compute cost reported.
When Not To Use
Latency-sensitive production services where interactive browsing would add unacceptable delay.
Scenarios where the underlying LLM cannot reliably perform precise tool invocation.
Failure Modes
Cascading errors from malformed tool calls or repeated bad actions.
Context pollution where failed traces bias future reasoning.

