Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

May 26, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, Jingren Zhou

Links

Abstract / PDF

Why It Matters For Business

Pre-training models to search and reason (RAMP) improves multi-step QA accuracy and generalization, reducing downstream data needs and making search-enabled agents more reliable.

Summary TLDR

MASKSEARCH introduces RAMP, a pre-training task where a model must search and reason to fill masked spans in text. The authors generate large Chain-of-Thought data (agent-based + self-distillation), train with Supervised Fine-Tuning (SFT) and Reinforcement Learning (DAPO), and use curriculum learning by mask count. Across multiple multi-hop QA benchmarks, RAMP pretraining consistently raises token-level recall versus strong baselines. RL + model-based rewards and curriculum learning give the largest gains. Code is available.

Problem Statement

LLM-based search agents work better when they can plan, call search tools, and reason across multiple steps, but current methods are trained on narrow, task-specific data. We need a scalable pre-training task that teaches general, transferable search-and-reason behavior so agents generalize to new open-domain QA tasks.

Main Contribution

Define Retrieval-Augmented Mask Prediction (RAMP): a scalable pre-training task where models search external corpora to fill masked spans.

A data pipeline combining multi-agent synthesis (planner, rewriter, observer) and iterative self-distillation to build large CoT datasets (10M trajectories).

Demonstrate SFT and RL (DAPO) training on RAMP, plus curriculum learning by number of masks, improving multi-hop QA recall across models and datasets.

Key Findings

RAMP pre-training raises average token-level recall on evaluated multi-hop QA benchmarks.

NumbersQwen2.5-7B avg recall: 71.01 vs Distilled 67.29 (+3.72)

RL training on RAMP yields larger in-domain improvements than SFT.

NumbersHotpotQA: RL->RL 75.61 vs SFT->SFT 70.44 (+5.17)

Curriculum learning by mask count improves generalization versus mixed training for some models.

NumbersLLaMA-3.2-1B avg recall: CL 55.93 vs Mix 53.67 (+2.26)

Results

HotpotQA token-level recall (Qwen2.5-7B)

Value75.61

BaselineDistilled Search-R1 69.55

Average recall across test sets (Qwen2.5-7B)

Value71.01

BaselineDistilled Search-R1 67.29

Average recall across test sets (LLaMA-3.2-1B)

Value57.40

BaselineDistilled Search-R1 47.14

Who Should Care

What To Try In 7 Days

Create a small RAMP dataset from Wikipedia using salient-span masking and agent prompts.

Fine-tune an existing instruct model on RAMP for a few epochs, then evaluate on one multi-hop QA set.

If resources allow, run DAPO RL with a model-based judge to evaluate additional gains.

Agent Features

Memory

  • masks retrieved tokens as latent variable during training

Planning

  • planner-rewriter-observer startup
  • multi-step search plan generation

Tool Use

  • search engine calls (web/Wikipedia)

Frameworks

  • DAPO (RL)
  • LLM-as-Judge
  • self-evolve distillation

Is Agentic

true

Architectures

  • single-model LLM agentic RAG

Collaboration

  • multi-agent synthesis for data (planner, rewriter, observer)

Optimization Features

Training Optimization

  • self-evolve distillation to scale dataset
  • curriculum learning by mask count

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only a search tool is used; not evaluated with other tool types.
  • Token-level recall metric can be gamed; authors note reward-hacking.
  • Self-distilled training data risks reduced diversity and upper-limit bias.

When Not To Use

  • When no reliable external search is available.
  • For tasks that require strict factual verification beyond token overlap.
  • When training resources cannot support RL or large-scale pretraining.

Failure Modes

  • Reward hacking in RL leads to long or enumerated outputs that inflate recall.
  • Overfitting to self-generated trajectories reduces out-of-domain generality.
  • Performance depends on quality/diversity of synthesized CoT data.

Core Entities

Models

  • QWEN2.5-1.5B
  • QWEN2.5-3B
  • QWEN2.5-7B
  • LLaMA-3.2-1B
  • LLaMA-3.2-3B
  • LLaMA-3.1-8B
  • QWEN-MAX
  • Qwen2.5-72B-Instruct

Metrics

  • token-level Recall

Datasets

  • Wikipedia
  • HotpotQA
  • FanoutQA
  • MuSiQue
  • 2WikiMultiHopQA
  • Bamboogle
  • FreshQA

Benchmarks

  • open-domain multi-hop QA