A 1,000+ task environment and benchmark that shows training on many verifiable tasks boosts LLM reasoning and efficiency

Overview

Decision SnapshotReady For Pilot

Paper provides code and concrete experiments showing task-scaling gains. Results are strong on their benchmark and several OOD tests, but full-method replication requires compute and careful bootcamp filtering. Evidence comes from multiple tables and figures but lacks broad external replication yet.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/7

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Peiji Li, Jiasheng Ye, Yongkang Chen, Yichuan Ma, Zijie Yu, Kedi Chen, Ganqu Cui, Haozhan Li, Jiacheng Chen, Chengqi Lyu, Wenwei Zhang, Linyang Li, Qipeng Guo, Dahua Lin, Bowen Zhou, Kai Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Training on many verifiable tasks yields broader reasoning, faster RL training, and better generalization—useful for robust assistants, automated QA, and data-synthesis pipelines.

Who Should Care

ML Engineer Data Scientist Engineering Lead Product Manager CTO

Summary TLDR

INTERNBOOTCAMP is an open-source library of 1000+ verifiable reasoning tasks and a 9,232-sample benchmark (BOOTCAMP-EVAL). The authors show that training LLMs on many verifiable tasks (task scaling) improves reasoning performance and RL training efficiency. They release code and data and demonstrate large gains when combining supervised fine-tuning (SFT) and reinforcement learning (RL) on synthesized long chain-of-thought data.

Problem Statement

Current RL efforts for LLM reasoning focus on narrow domains (math/code). Real-world reasoning needs cross-domain, verifiable tasks and scalable task generation. Building large, verifiable task libraries by hand is impractical, and it's unclear how increasing the number of training tasks affects reasoning generalization and RL training efficiency.

Main Contribution

INTERNBOOTCAMP: open-source library with 1000+ verifiable reasoning task classes and unified interfaces to generate problems and verify solutions.

BOOTCAMP-EVAL: a cross-domain benchmark (9,232 samples across 118 human-curated tasks) for measuring reasoning generalization.

Key Findings

INTERNBOOTCAMP supports 1000+ reasoning task classes and the authors used a core set of 704 tasks for experiments.

Numbers1000+ tasks total; 704 tasks retained for experiments

Practical UseUse the library to train or synthesize data across many task types; the published core set (704) is ready for scaling experiments.

Evidence RefSec.3.1–3.4; Sec.3.3

BOOTCAMP-EVAL is a verifiable evaluation suite of 9,232 examples spanning 118 tasks.

Numbers9,232 samples across 118 tasks

Practical UseUse BOOTCAMP-EVAL to measure cross-domain reasoning generalization without train-test contamination.

Evidence RefTable 4; Sec.3.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BOOTCAMP-EVAL size	9,232 samples across 118 tasks	—	—	BOOTCAMP-EVAL	Table 4; Sec.3.4	Table 4
Core tasks used in experiments	704 tasks retained for task-scaling experiments	—	—	INTERNBOOTCAMP pool	Sec.3.3 (after filtering and deduplication)	Sec.3.3

What To Try In 7 Days

Clone INTERNBOOTCAMP and run BOOTCAMP-EVAL on your model to get a cross-domain baseline.

Generate a small synthetic dataset (few thousand long-CoT samples) from 50–200 bootcamp tasks and fine-tune (SFT) one model.

Follow SFT with a short RL pass (DAPO/GRPO) using verify_score as reward; compare to SFT-only and RL-only baselines over 300 steps or fewer steps to spot efficiency gains.

Optimization Features

Training Optimization

Task scaling: multi-task RL on many verifiable tasksSFTDAPO-like dynamic-sampling RL to maintain verification score diversityOversampling rollout strategy (rollout batch 3x prompt batch)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/InternLM/InternBootcamp

Data URLs

https://github.com/InternLM/InternBootcamp

Risks & Boundaries

Limitations

Automatically generated bootcamps can be overly simplified or semantically wrong; the pipeline relies on heuristic thresholds and human review (Sec.3.3).

Random task selection for scaling experiments can overlap conceptually with evaluation domains; authors control for data-level contamination but category overlap remains possible (Sec.4.1).

When Not To Use

Don't use a small number of narrow tasks (e.g., 8 tasks) for RLVR: entropy collapse and degenerate rollouts cause inefficient training (Sec.4.2, Fig.6).

Avoid trusting automatically generated bootcamps without execution tests and manual inspection—generation alone can produce broken or trivial tasks.

Failure Modes

Entropy collapse in RL rollouts when task diversity is low, producing all-correct or all-wrong responses and invalid preference data (Sec.4.2).

Automated bootcamp simplification where generated tasks cover only a narrow instance set; needs iterative refinement (Sec.3.3).

Core Entities

Models

DeepSeek-R1DeepSeek-V3DS-R1-Distilled-Qwen-32BQwen2.5-32B-InstructQwen2.5-7B-InstructQwen3-32BQwen3-235B-A22B

Metrics

verification score (0-1)overall bootcamp evaluation scoreAccuracyrollout batch generation count (training efficiency proxy)

Datasets

BOOTCAMP-EVALINTERNBOOTCAMP task pool (1000+ tasks; 704 core used)55K synthesized long-CoT samples11K math supplementary data

Benchmarks

BOOTCAMP-EVALBBEHKOR BenchMMLUAIMEGPQA DiamondSuper GPQALCB v6

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

INTERNBOOTCAMP supports 1000+ reasoning task classes and the authors used a core set of 704 tasks for experiments.

BOOTCAMP-EVAL is a verifiable evaluation suite of 9,232 examples spanning 118 tasks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

Key finding

MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

Key finding

ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding

ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

Key finding