Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

May 26, 20257 min

Overview

Decision SnapshotNeeds Validation

The paper shows consistent empirical gains across model families and datasets; RL and curriculum add benefits, but evaluation uses token-recall which can be gamed and data diversity (self-distillation) may limit upper bounds.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, Jingren Zhou

Links

Abstract / PDF / Code

Why It Matters For Business

Pre-training models to search and reason (RAMP) improves multi-step QA accuracy and generalization, reducing downstream data needs and making search-enabled agents more reliable.

Who Should Care

Summary TLDR

MASKSEARCH introduces RAMP, a pre-training task where a model must search and reason to fill masked spans in text. The authors generate large Chain-of-Thought data (agent-based + self-distillation), train with Supervised Fine-Tuning (SFT) and Reinforcement Learning (DAPO), and use curriculum learning by mask count. Across multiple multi-hop QA benchmarks, RAMP pretraining consistently raises token-level recall versus strong baselines. RL + model-based rewards and curriculum learning give the largest gains. Code is available.

Problem Statement

LLM-based search agents work better when they can plan, call search tools, and reason across multiple steps, but current methods are trained on narrow, task-specific data. We need a scalable pre-training task that teaches general, transferable search-and-reason behavior so agents generalize to new open-domain QA tasks.

Main Contribution

Define Retrieval-Augmented Mask Prediction (RAMP): a scalable pre-training task where models search external corpora to fill masked spans.

A data pipeline combining multi-agent synthesis (planner, rewriter, observer) and iterative self-distillation to build large CoT datasets (10M trajectories).

Key Findings

RAMP pre-training raises average token-level recall on evaluated multi-hop QA benchmarks.

NumbersQwen2.5-7B avg recall: 71.01 vs Distilled 67.29 (+3.72)

Practical UseAdd a RAMP pre-training stage before downstream fine-tuning to get a ~3–4 point average recall boost on multi-hop QA benchmarks.

Evidence RefTable 2

RL training on RAMP yields larger in-domain improvements than SFT.

NumbersHotpotQA: RL->RL 75.61 vs SFT->SFT 70.44 (+5.17)

Practical UseIf you can afford RL (DAPO) and a model-based judge, use RL on RAMP to push higher in-domain performance.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HotpotQA token-level recall (Qwen2.5-7B)75.61Distilled Search-R1 69.55+6.06HotpotQATable 2 shows MASKSEARCH RL->RL 75.61 vs Distilled Search-R1 69.55Table 2
Average recall across test sets (Qwen2.5-7B)71.01Distilled Search-R1 67.29+3.72Avg across listed datasetsTable 2 average resultsTable 2

What To Try In 7 Days

Create a small RAMP dataset from Wikipedia using salient-span masking and agent prompts.

Fine-tune an existing instruct model on RAMP for a few epochs, then evaluate on one multi-hop QA set.

If resources allow, run DAPO RL with a model-based judge to evaluate additional gains.

Agent Features

Memory
masks retrieved tokens as latent variable during training
Planning
planner-rewriter-observer startupmulti-step search plan generation
Tool Use
search engine calls (web/Wikipedia)
Frameworks
DAPO (RL)LLM-as-Judgeself-evolve distillation
Is Agentic

Yes

Architectures
single-model LLM agentic RAG
Collaboration
multi-agent synthesis for data (planner, rewriter, observer)

Optimization Features

Training Optimization
self-evolve distillation to scale datasetcurriculum learning by mask count

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only a search tool is used; not evaluated with other tool types.

Token-level recall metric can be gamed; authors note reward-hacking.

When Not To Use

When no reliable external search is available.

For tasks that require strict factual verification beyond token overlap.

Failure Modes

Reward hacking in RL leads to long or enumerated outputs that inflate recall.

Overfitting to self-generated trajectories reduces out-of-domain generality.

Core Entities

Models

QWEN2.5-1.5BQWEN2.5-3BQWEN2.5-7BLLaMA-3.2-1BLLaMA-3.2-3BLLaMA-3.1-8BQWEN-MAXQwen2.5-72B-Instruct

Metrics

token-level Recall

Datasets

WikipediaHotpotQAFanoutQAMuSiQue2WikiMultiHopQABamboogleFreshQA

Benchmarks

open-domain multi-hop QA