Train the model to plan first, then search: RL for planning + SFT for multi-round retrieval boosts multi-hop QA

Overview

Decision SnapshotReady For Pilot

Results show consistent EM increases across four benchmarks and ablations confirm both RL planning and SFT matter. Experiments use public multi-hop datasets but rely on heavy compute and synthetic SFT data, limiting immediate plug-and-play adoption.

Citations0

Evidence Strength0.80

Confidence0.70

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Kun Chen, Qingchao Kong, Zhao Feifei, Wenji Mao

Links

Abstract / PDF

Why It Matters For Business

APEX-Searcher raises strict exact-answer accuracy on multi-hop, Wikipedia-style QA. That reduces human verification work and improves automation for information-seeking tasks that need precise facts.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

APEX-Searcher separates strategic planning from execution. First, an LLM is trained with RL to decompose multi-hop questions into an ordered plan. Second, the LLM is fine-tuned (SFT) to execute that plan with multi-round retrieval. Combined, these steps raise Exact Match (EM) on several multi-hop QA benchmarks by substantial margins versus standard and agentic RAG baselines.

Problem Statement

Single-pass RAG and naive iterative retrieval struggle on multi-hop questions because they lack a global search plan. More retrieval steps alone often add noise, repeat queries, or miss dependencies between subtasks.

Main Contribution

A two-stage agent architecture: RL-based planning agent that decomposes questions, plus an execution agent trained with supervised fine-tuning to perform multi-round retrieval.

A training recipe: Group Relative Policy Optimization (GRPO) for plan quality, plus a 14.6k-instance multi-turn SFT dataset for exploration.

Key Findings

APEX-Searcher outperforms standard RAG and prior agentic RAGs on multi-hop QA EM.

Numbers7B avg EM 0.376 vs Standard RAG 0.200 (Table 1)

Practical UseIf you need higher exact-answer accuracy on multi-hop QA over a Wikipedia-style corpus, adopt a planning+execution pipeline like APEX-Searcher.

Evidence RefTable 1; Section 4.3

Ablation: combining planning (with RL) and SFT yields the largest gains.

Numbers7B Eval 27.55 -> 37.64 (≈+36.6% relative); 3B Eval 13.42 -> 33.45 (+149%)

Practical UseBoth steps matter: train a planner with RL and fine-tune execution. Skipping either reduces or reverses gains, especially on small models.

Evidence RefTable 2; Section 4.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Exact Match (EM) per benchmark - Apex-Searcher (7B)	HotpotQA 0.402; 2Wiki 0.540; MuSiQue 0.164; Bamboogle 0.400; Avg 0.376	Standard RAG (7B) Avg 0.200	Avg +0.176 (7B Apex vs Standard RAG)	Table 1	Table 1 reports per-benchmark EM and averages	Table 1
Exact Match (EM) per benchmark - Apex-Searcher (3B)	HotpotQA 0.356; 2Wiki 0.494; MuSiQue 0.136; Bamboogle 0.352; Avg 0.335	Standard RAG (3B) Avg 0.152	Avg +0.183 (3B Apex vs Standard RAG)	Table 1	Table 1 reports per-benchmark EM and averages	Table 1

What To Try In 7 Days

Run a small pilot: add a planning prompt that decomposes complex queries into 2–4 subquestions.

Fine-tune your generation model on a few thousand multi-turn retrieval examples (use self-instruct + validation).

Tune two hyperparameters: number of retrieved docs (start at 3) and max hops (start 3–5). Measure Exact Match on held-out multi-hop queries.

Agent Features

Memory

accumulated knowledge base (K_acc) - short-term QA pairs for context

Planning

RL-based task decompositionGRPOstructured output with #n reference placeholders

Tool Use

search index / retrieverdocument de-duplicationdynamic query generation

Frameworks

iterative retrieval loopcontinuation decision (ShouldContinueRetrieval)

Is Agentic

Yes

Architectures

LLM (Qwen2.5 family)two-agent pipeline: Planning Agent + Execution Agent

Optimization Features

Token Efficiency

SFT

Infra Optimization

DeepSpeedGRPOSFT

Model Optimization

SFTGRPO

System Optimization

DeepSpeed ZeRO stage 3 for memory scalingsequence parallelism

Training Optimization

bfloat16 precisiongradient checkpointingKL penalty and entropy regularization for RL

Inference Optimization

Flash Attention to speed attention computation

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation restricted to Wikipedia-style multi-hop QA benchmarks; web search and broader corpora not evaluated.

SFT data is synthetic (generated and filtered), which risks alignment with generator biases.

When Not To Use

For single-hop or simple factual queries where standard RAG is sufficient.

When compute or fine-tuning budget is very small.

Failure Modes

Over-decomposition: planner splits tasks into unnecessary steps, adding error and latency.

Noise from too many retrieved documents can degrade accuracy.

Core Entities

Models

Qwen2.5-7B-InstructQwen2.5-3B-InstructQwen2.5-32B-InstructDeepSeek-v3

Metrics

Exact Match (EM)

Datasets

HotpotQA2WikiMultiHopQAMuSiQueBamboogleSFTMuSiQue planning set (10,473 examples)

Benchmarks

HotpotQA2WikiMultiHopQAMuSiQueBamboogle

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

APEX-Searcher outperforms standard RAG and prior agentic RAGs on multi-hop QA EM.

Ablation: combining planning (with RL) and SFT yields the largest gains.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding