Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Pre-training models to search and reason (RAMP) improves multi-step QA accuracy and generalization, reducing downstream data needs and making search-enabled agents more reliable.
Summary TLDR
MASKSEARCH introduces RAMP, a pre-training task where a model must search and reason to fill masked spans in text. The authors generate large Chain-of-Thought data (agent-based + self-distillation), train with Supervised Fine-Tuning (SFT) and Reinforcement Learning (DAPO), and use curriculum learning by mask count. Across multiple multi-hop QA benchmarks, RAMP pretraining consistently raises token-level recall versus strong baselines. RL + model-based rewards and curriculum learning give the largest gains. Code is available.
Problem Statement
LLM-based search agents work better when they can plan, call search tools, and reason across multiple steps, but current methods are trained on narrow, task-specific data. We need a scalable pre-training task that teaches general, transferable search-and-reason behavior so agents generalize to new open-domain QA tasks.
Main Contribution
Define Retrieval-Augmented Mask Prediction (RAMP): a scalable pre-training task where models search external corpora to fill masked spans.
A data pipeline combining multi-agent synthesis (planner, rewriter, observer) and iterative self-distillation to build large CoT datasets (10M trajectories).
Demonstrate SFT and RL (DAPO) training on RAMP, plus curriculum learning by number of masks, improving multi-hop QA recall across models and datasets.
Key Findings
RAMP pre-training raises average token-level recall on evaluated multi-hop QA benchmarks.
RL training on RAMP yields larger in-domain improvements than SFT.
Curriculum learning by mask count improves generalization versus mixed training for some models.
Results
HotpotQA token-level recall (Qwen2.5-7B)
Average recall across test sets (Qwen2.5-7B)
Average recall across test sets (LLaMA-3.2-1B)
Who Should Care
What To Try In 7 Days
Create a small RAMP dataset from Wikipedia using salient-span masking and agent prompts.
Fine-tune an existing instruct model on RAMP for a few epochs, then evaluate on one multi-hop QA set.
If resources allow, run DAPO RL with a model-based judge to evaluate additional gains.
Agent Features
Memory
- masks retrieved tokens as latent variable during training
Planning
- planner-rewriter-observer startup
- multi-step search plan generation
Tool Use
- search engine calls (web/Wikipedia)
Frameworks
- DAPO (RL)
- LLM-as-Judge
- self-evolve distillation
Is Agentic
true
Architectures
- single-model LLM agentic RAG
Collaboration
- multi-agent synthesis for data (planner, rewriter, observer)
Optimization Features
Training Optimization
- self-evolve distillation to scale dataset
- curriculum learning by mask count
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only a search tool is used; not evaluated with other tool types.
- Token-level recall metric can be gamed; authors note reward-hacking.
- Self-distilled training data risks reduced diversity and upper-limit bias.
When Not To Use
- When no reliable external search is available.
- For tasks that require strict factual verification beyond token overlap.
- When training resources cannot support RL or large-scale pretraining.
Failure Modes
- Reward hacking in RL leads to long or enumerated outputs that inflate recall.
- Overfitting to self-generated trajectories reduces out-of-domain generality.
- Performance depends on quality/diversity of synthesized CoT data.
Core Entities
Models
- QWEN2.5-1.5B
- QWEN2.5-3B
- QWEN2.5-7B
- LLaMA-3.2-1B
- LLaMA-3.2-3B
- LLaMA-3.1-8B
- QWEN-MAX
- Qwen2.5-72B-Instruct
Metrics
- token-level Recall
Datasets
- Wikipedia
- HotpotQA
- FanoutQA
- MuSiQue
- 2WikiMultiHopQA
- Bamboogle
- FreshQA
Benchmarks
- open-domain multi-hop QA

