Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Modular orchestration and structured memory reduce failed web/data tasks and improve first-run correctness, lowering human oversight and accelerating research pipelines.
Summary TLDR
Yunque DeepResearch is a hierarchical, modular multi-agent framework for long-horizon research tasks. It groups interactions into sub-goal memory units, routes subtasks to an atomic pool of tools and sub-agents (browser GUI, data-analysis), and adds a Supervisor that interrupts and prunes failing trajectories. On four agentic benchmarks it reports Pass@1: BrowseComp 62.5, BrowseComp-ZH 75.9, GAIA 78.6, HLE 51.7 and shows model-agnostic gains (e.g., +10.0 on BrowseComp for Gemini 3 Pro). The code is claimed open-source; implementation and ablations are included.
Problem Statement
Current deep-research agents struggle with three practical problems: 1) context noise and information overload in long-horizon tasks; 2) fragile execution that causes cascading errors and loops; 3) rigid architectures that block integrating specialized tools and sub-agents.
Main Contribution
A hierarchical multi-agent architecture that decouples planning from execution via a Main Agent, Context Manager, Atomic Capability Pool, and Supervisor.
Sub-goal-driven structured memory that folds multi-round traces into concise memory units and switches context between detailed traces and compressed summaries.
Atomic Capability Pool with specialized sub-agents (Browser-Use GUI, Data Analysis) and a strict one-tool-per-turn interface.
A Supervisor module that detects anomalies, prunes invalid traces, and forces adaptive interrupts and recovery.
Empirical evaluation showing strong gains on BrowseComp, BrowseComp-ZH, GAIA, and Humanity's Last Exam; open-source release promised.
Key Findings
Yunque achieves top Pass@1 on three browsing/data benchmarks.
Yunque is second on GAIA and improves base model performance.
Structured memory strongly affects browsing results.
Supervisor reduces execution fragility.
Results
Pass@1 (BrowseComp)
Pass@1 (BrowseComp-ZH)
Pass@1 (GAIA)
Pass@1 (Humanity's Last Exam)
Who Should Care
What To Try In 7 Days
Add sub-goal memory: fold completed subtasks into compact summaries to cut context noise.
Introduce a lightweight Supervisor to detect and prune failed traces before retries.
Prototype a Browser + Data Analysis sub-agent pair with a one-tool-per-turn constraint.
Agent Features
Memory
- sub-goal-driven structured memory units (4-tuple)
- dynamic folding and compression of completed sub-goals
Planning
- dynamic task decomposition into sub-goals
- adaptive routing to tools or specialized sub-agents
Tool Use
- Browser-Use GUI Agent (interactive web actions)
- Data Analysis Agent (code generation + execution)
- one-tool-per-turn execution constraint
Frameworks
- open-source implementation (claimed)
- composable Atomic Capability Pool for plugging new tools
Is Agentic
true
Architectures
- hierarchical multi-agent (Main Agent + sub-agents)
- centralized orchestration with atomic capability pool
Collaboration
- Main Agent delegates to specialized sub-agents
- Supervisor intervenes with interrupts and recovery
Optimization Features
Token Efficiency
- PDF paging and incremental markdown conversion to avoid injecting full long docs
System Optimization
- one-tool-per-turn policy to decompose complex actions
- context complexity reduction from O(t) to O(n) via sub-goal folding
Reproducibility
Code Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Ablations on specialized sub-agents are mainly limited to GAIA and may not generalize to all domains.
- No systematic measurement of token consumption, latency, or compute cost reported.
- Claims of open-source code are made but links and licenses are not provided in the text.
When Not To Use
- Latency-sensitive production services where interactive browsing would add unacceptable delay.
- Scenarios where the underlying LLM cannot reliably perform precise tool invocation.
- Domain-specific tasks not covered by the evaluated benchmarks without further tuning.
Failure Modes
- Cascading errors from malformed tool calls or repeated bad actions.
- Context pollution where failed traces bias future reasoning.
- Dependency on backbone model's ability to execute tool calls accurately.
Core Entities
Models
- Gemini 3 Pro
- DeepSeek-V3.2
- Kimi K2 Thinking
- GLM 4.7
- GPT-5 High
- Claude-4.5-Sonnet
- OpenAI-o3
Metrics
- Pass@1
- Pass@N
Datasets
- GAIA
- BrowseComp
- BrowseComp-ZH
- Humanity's Last Exam
Benchmarks
- GAIA
- BrowseComp
- BrowseComp-ZH
- Humanity's Last Exam

