Large-scale benchmark: continual pretraining helps GPT models but can harm Llama2‑7B

Overview

Decision SnapshotNeeds Validation

The paper provides clear large-scale empirical evidence across many domains and models, but findings are descriptive, compute-heavy, and limited to one benchmark ordering for similar-order experiments.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 1/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Çağatay Yıldız, Nishaanth Kanna Ravichandran, Nitin Sharma, Matthias Bethge, Beyza Ermis

Links

Abstract / PDF / Data

Why It Matters For Business

Continual pretraining can produce better domain experts and reduce repeated retraining costs for smaller models, but it carries heavy compute and can harm very large models unless domain corpora are large and relevant.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

The paper builds a large continual-pretraining benchmark using M2D2 (159 domains, 6.6B tokens) and tracks perplexity, transfer, and forgetting across checkpoints. Main practical findings: continual pretraining consistently improves GPT-2 family models and outperforms standalone domain-adaptive pretraining; larger models get better final perplexity and forget less; smaller models learn and forget the most; randomizing domain order reduces forgetting and improves final checkpoints; Llama2-7B degrades unless domains are large (>100 MB). The study focuses on measuring dynamics rather than proposing fixes.

Problem Statement

Continual learning for LLMs has been studied mostly on fine-tuning or small-scale settings. There is no large-scale, realistic benchmark that measures how incremental pretraining across many domains affects knowledge accumulation, forgetting, and downstream transfer for different model families and sizes.

Main Contribution

A large continual-pretraining benchmark using M2D2 across 159 domains (6.6B tokens) and systematic evaluation of checkpoints.

Empirical study across multiple model families (GPT-2 sizes, Llama2-7B, RoBERTa) measuring forward/backward transfer, forgetting, and downstream task impact.

Key Findings

Continual pretraining reliably improves GPT-2 family perplexity and outperforms standalone domain-adaptive pretraining.

NumbersMeasured over 159 domains; CPT median better than DAPT across GPT2 sizes

Practical UseIf you maintain a fleet of domain experts, prefer continual pretraining with checkpointing at domain shifts over separate domain-adaptive runs.

Evidence RefSections 4.1, Figure 3

Llama2-7B degrades with additional (continual or domain-adaptive) pretraining on small domains.

NumbersImprovement rare below ~75–100 MB domain size; needs >100 MB to improve

Practical UseAvoid extra pretraining of Llama2-7B on small domain corpora; aggregate domains or use larger corpora before adapting.

Evidence RefSection 4.1, Figure 17

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
benchmark scale	159 domains; 6.6B tokens	—	—	M2D2 (Wiki + S2ORC)	Table 1; Sections 2–3	Table 1, Section 2
total evaluations	12,561 checkpoint-vs-domain tests per model per setup	—	—	all domains	Discussion, Section 6	Section 6

What To Try In 7 Days

Run short continual pretraining on a small GPT family model and checkpoint per domain to gauge CPT vs DAPT.

Randomize domain order in trials to test retention before committing to a curriculum.

Check domain sizes; avoid adapting very large foundation models on domains <100MB without aggregation.

Optimization Features

System Optimization

DeepSpeed auto configuration

Training Optimization

checkpointing at domain shiftsbatch-size ablation (16 vs 64)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://arxiv.org/abs/2210.07370 (M2D2)https://github.com/Skylion007/OpenWebTextCorpus (OpenWebText)https://github.com/allenai/s2orc (S2ORC)

Risks & Boundaries

Limitations

Similar-order curriculum was constructed only once; results may vary with alternative similar curricula.

RoBERTa may overlap with Wiki in training data, possibly biasing encoder results.

When Not To Use

Do not apply continual pretraining on a very large base model (e.g., Llama2-7B) using many small domains (<100 MB) without validation.

Do not expect continual pretraining to solve forgetting without additional retention strategies on long horizons.

Failure Modes

Catastrophic degradation for Llama2-7B when domain corpora are small.

Late-stage overfitting and increased forgetting as continual training fills capacity.

Core Entities

Models

GPT2-SGPT2-MGPT2-LGPT2-XLLlama2-7BRoBERTa-baseRoBERTa-large

Metrics

perplexityforward transferbackward transferforgetting (FG)normalized aggregate downstream scoreprediction token rank

Datasets

M2D2S2ORCWikipediaOpenWebText

Benchmarks

BIG-Bench (Arithmetic)BIG-Bench (General Knowledge)BIG-Bench (Physics)BIG-Bench (CS Algorithms)BIG-Bench (Few-shot NLG)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Continual pretraining reliably improves GPT-2 family perplexity and outperforms standalone domain-adaptive pretraining.

Llama2-7B degrades with additional (continual or domain-adaptive) pretraining on small domains.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding

Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Key finding

Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

Key finding

Survey: how to update LLMs continuously without full retraining

Key finding