Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Overview

Decision SnapshotNeeds Validation

The paper shows consistent empirical gains across model families and datasets; RL and curriculum add benefits, but evaluation uses token-recall which can be gamed and data diversity (self-distillation) may limit upper bounds.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, Jingren Zhou

Links

Abstract / PDF / Code

Why It Matters For Business

Pre-training models to search and reason (RAMP) improves multi-step QA accuracy and generalization, reducing downstream data needs and making search-enabled agents more reliable.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

MASKSEARCH introduces RAMP, a pre-training task where a model must search and reason to fill masked spans in text. The authors generate large Chain-of-Thought data (agent-based + self-distillation), train with Supervised Fine-Tuning (SFT) and Reinforcement Learning (DAPO), and use curriculum learning by mask count. Across multiple multi-hop QA benchmarks, RAMP pretraining consistently raises token-level recall versus strong baselines. RL + model-based rewards and curriculum learning give the largest gains. Code is available.

Problem Statement

LLM-based search agents work better when they can plan, call search tools, and reason across multiple steps, but current methods are trained on narrow, task-specific data. We need a scalable pre-training task that teaches general, transferable search-and-reason behavior so agents generalize to new open-domain QA tasks.

Main Contribution

Define Retrieval-Augmented Mask Prediction (RAMP): a scalable pre-training task where models search external corpora to fill masked spans.

A data pipeline combining multi-agent synthesis (planner, rewriter, observer) and iterative self-distillation to build large CoT datasets (10M trajectories).

Key Findings

RAMP pre-training raises average token-level recall on evaluated multi-hop QA benchmarks.

NumbersQwen2.5-7B avg recall: 71.01 vs Distilled 67.29 (+3.72)

Practical UseAdd a RAMP pre-training stage before downstream fine-tuning to get a ~3–4 point average recall boost on multi-hop QA benchmarks.

Evidence RefTable 2

RL training on RAMP yields larger in-domain improvements than SFT.

NumbersHotpotQA: RL->RL 75.61 vs SFT->SFT 70.44 (+5.17)

Practical UseIf you can afford RL (DAPO) and a model-based judge, use RL on RAMP to push higher in-domain performance.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HotpotQA token-level recall (Qwen2.5-7B)	75.61	Distilled Search-R1 69.55	+6.06	HotpotQA	Table 2 shows MASKSEARCH RL->RL 75.61 vs Distilled Search-R1 69.55	Table 2
Average recall across test sets (Qwen2.5-7B)	71.01	Distilled Search-R1 67.29	+3.72	Avg across listed datasets	Table 2 average results	Table 2

What To Try In 7 Days

Create a small RAMP dataset from Wikipedia using salient-span masking and agent prompts.

Fine-tune an existing instruct model on RAMP for a few epochs, then evaluate on one multi-hop QA set.

If resources allow, run DAPO RL with a model-based judge to evaluate additional gains.

Agent Features

Memory

masks retrieved tokens as latent variable during training

Planning

planner-rewriter-observer startupmulti-step search plan generation

Tool Use

search engine calls (web/Wikipedia)

Frameworks

DAPO (RL)LLM-as-Judgeself-evolve distillation

Is Agentic

Yes

Architectures

single-model LLM agentic RAG

Collaboration

multi-agent synthesis for data (planner, rewriter, observer)

Optimization Features

Training Optimization

self-evolve distillation to scale datasetcurriculum learning by mask count

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Alibaba-NLP/MaskSearch

Risks & Boundaries

Limitations

Only a search tool is used; not evaluated with other tool types.

Token-level recall metric can be gamed; authors note reward-hacking.

When Not To Use

When no reliable external search is available.

For tasks that require strict factual verification beyond token overlap.

Failure Modes

Reward hacking in RL leads to long or enumerated outputs that inflate recall.

Overfitting to self-generated trajectories reduces out-of-domain generality.

Core Entities

Models

QWEN2.5-1.5BQWEN2.5-3BQWEN2.5-7BLLaMA-3.2-1BLLaMA-3.2-3BLLaMA-3.1-8BQWEN-MAXQwen2.5-72B-Instruct

Metrics

token-level Recall

Datasets

WikipediaHotpotQAFanoutQAMuSiQue2WikiMultiHopQABamboogleFreshQA

Benchmarks

open-domain multi-hop QA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RAMP pre-training raises average token-level recall on evaluated multi-hop QA benchmarks.

RL training on RAMP yields larger in-domain improvements than SFT.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding