A 1,000+ task environment and benchmark that shows training on many verifiable tasks boosts LLM reasoning and efficiency

August 12, 20259 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Peiji Li, Jiasheng Ye, Yongkang Chen, Yichuan Ma, Zijie Yu, Kedi Chen, Ganqu Cui, Haozhan Li, Jiacheng Chen, Chengqi Lyu, Wenwei Zhang, Linyang Li, Qipeng Guo, Dahua Lin, Bowen Zhou, Kai Chen

Links

Abstract / PDF

Why It Matters For Business

Training on many verifiable tasks yields broader reasoning, faster RL training, and better generalization—useful for robust assistants, automated QA, and data-synthesis pipelines.

Summary TLDR

INTERNBOOTCAMP is an open-source library of 1000+ verifiable reasoning tasks and a 9,232-sample benchmark (BOOTCAMP-EVAL). The authors show that training LLMs on many verifiable tasks (task scaling) improves reasoning performance and RL training efficiency. They release code and data and demonstrate large gains when combining supervised fine-tuning (SFT) and reinforcement learning (RL) on synthesized long chain-of-thought data.

Problem Statement

Current RL efforts for LLM reasoning focus on narrow domains (math/code). Real-world reasoning needs cross-domain, verifiable tasks and scalable task generation. Building large, verifiable task libraries by hand is impractical, and it's unclear how increasing the number of training tasks affects reasoning generalization and RL training efficiency.

Main Contribution

INTERNBOOTCAMP: open-source library with 1000+ verifiable reasoning task classes and unified interfaces to generate problems and verify solutions.

BOOTCAMP-EVAL: a cross-domain benchmark (9,232 samples across 118 human-curated tasks) for measuring reasoning generalization.

Automatic agent workflow (evolutionary generation + self-consistent unit tests) to scale bootcamp creation and retain a core set of 704 high-quality tasks for experiments.

Empirical finding: scaling the number of verifiable training tasks (8 → 512 → full set) improves reasoning performance and RL training efficiency, and enables emergent abilities.

Practical recipe: data synthesis + SFT followed by RL on bootcamp tasks yields the best gains on in-domain and OOD reasoning benchmarks.

Key Findings

INTERNBOOTCAMP supports 1000+ reasoning task classes and the authors used a core set of 704 tasks for experiments.

Numbers1000+ tasks total; 704 tasks retained for experiments

BOOTCAMP-EVAL is a verifiable evaluation suite of 9,232 examples spanning 118 tasks.

Numbers9,232 samples across 118 tasks

Scaling number of training tasks improves RL training efficiency and validation performance roughly linearly between 8 and 512 tasks.

NumbersObserved near-linear validation improvement when increasing tasks from 8 to 512 (Figures 5b & 5a)

Large performance gains from Bootcamp training: DeepSeek-R1-Distilled-Qwen-32B overall score improved 31.5 → 55.7 after Bootcamp-RL; Qwen2.5-32B-Instruct improved 24.4 → 61.1 after Bootcamp-SFT.

NumbersDS baseline 31.5 → 55.7 (+24.2); Qwen2.5 baseline 24.4 → 61.1 (+36.7) (Table 5)

Automated generation quality improved with iterative (evolutionary) refinement: simplification keyword presence fell from 97.93% to 32.46% across 3 iterations, and problematic bootcamps decreased from 33/228 to 14/228.

NumbersSimplify-keyword freq: 97.93%→54.39%→32.46%; problematic bootcamps: 33/228→19/228→14/228 (Table 1 & 2)

Emergent learning: tasks that do not improve when trained in isolation can become solvable when trained in a diverse 512-task mix, with a critical improvement after ~300 RL steps.

NumbersEmergent moment observed around 300 RL steps in 512-task training (Fig.8)

Results

BOOTCAMP-EVAL size

Value9,232 samples across 118 tasks

Core tasks used in experiments

Value704 tasks retained for task-scaling experiments

DeepSeek-R1-Distilled-Qwen-32B overall BOOTCAMP-EVAL score

Value31.5 → 55.7 (after Bootcamp-RL)

Baseline31.5 (pre-RL)

Qwen2.5-32B-Instruct overall BOOTCAMP-EVAL score

Value24.4 → 61.1 (Bootcamp-SFT)

Baseline24.4 (pre-training)

Out-of-domain avg score (selected)

ValueDS-R1-Distilled-Qwen-32B: 52.5 → 56.9 (+4.4) with Bootcamp-RL; →61.8 with SFT+RL

Baseline52.5 (baseline avg in Table 6)

Self-consistent unittest filtering thresholds

ValueFilter bootcamps with self-accuracy >0.85 or <0.03

Emergent improvement timing

ValueCritical emergence observed ≈300 RL steps for 512-task mixed training

BaselineNo gains when trained in isolation

Who Should Care

What To Try In 7 Days

Clone INTERNBOOTCAMP and run BOOTCAMP-EVAL on your model to get a cross-domain baseline.

Generate a small synthetic dataset (few thousand long-CoT samples) from 50–200 bootcamp tasks and fine-tune (SFT) one model.

Follow SFT with a short RL pass (DAPO/GRPO) using verify_score as reward; compare to SFT-only and RL-only baselines over 300 steps or fewer steps to spot efficiency gains.

Optimization Features

Training Optimization

  • Task scaling: multi-task RL on many verifiable tasks
  • SFT
  • DAPO-like dynamic-sampling RL to maintain verification score diversity
  • Oversampling rollout strategy (rollout batch 3x prompt batch)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Automatically generated bootcamps can be overly simplified or semantically wrong; the pipeline relies on heuristic thresholds and human review (Sec.3.3).
  • Random task selection for scaling experiments can overlap conceptually with evaluation domains; authors control for data-level contamination but category overlap remains possible (Sec.4.1).
  • Reported gains are tied to verifiable tasks and the provided verification mechanism; results may not transfer to tasks that lack rule-based verifiers.

When Not To Use

  • Don't use a small number of narrow tasks (e.g., 8 tasks) for RLVR: entropy collapse and degenerate rollouts cause inefficient training (Sec.4.2, Fig.6).
  • Avoid trusting automatically generated bootcamps without execution tests and manual inspection—generation alone can produce broken or trivial tasks.

Failure Modes

  • Entropy collapse in RL rollouts when task diversity is low, producing all-correct or all-wrong responses and invalid preference data (Sec.4.2).
  • Automated bootcamp simplification where generated tasks cover only a narrow instance set; needs iterative refinement (Sec.3.3).
  • Self-consistent filtering may still pass subtle semantic errors; requires human spot-checking.

Core Entities

Models

  • DeepSeek-R1
  • DeepSeek-V3
  • DS-R1-Distilled-Qwen-32B
  • Qwen2.5-32B-Instruct
  • Qwen2.5-7B-Instruct
  • Qwen3-32B
  • Qwen3-235B-A22B

Metrics

  • verification score (0-1)
  • overall bootcamp evaluation score
  • Accuracy
  • rollout batch generation count (training efficiency proxy)

Datasets

  • BOOTCAMP-EVAL
  • INTERNBOOTCAMP task pool (1000+ tasks; 704 core used)
  • 55K synthesized long-CoT samples
  • 11K math supplementary data

Benchmarks

  • BOOTCAMP-EVAL
  • BBEH
  • KOR Bench
  • MMLU
  • AIME
  • GPQA Diamond
  • Super GPQA
  • LCB v6