Large-scale benchmark: continual pretraining helps GPT models but can harm Llama2‑7B

February 27, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper provides clear large-scale empirical evidence across many domains and models, but findings are descriptive, compute-heavy, and limited to one benchmark ordering for similar-order experiments.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 1/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Çağatay Yıldız, Nishaanth Kanna Ravichandran, Nitin Sharma, Matthias Bethge, Beyza Ermis

Links

Abstract / PDF / Data

Why It Matters For Business

Continual pretraining can produce better domain experts and reduce repeated retraining costs for smaller models, but it carries heavy compute and can harm very large models unless domain corpora are large and relevant.

Who Should Care

Summary TLDR

The paper builds a large continual-pretraining benchmark using M2D2 (159 domains, 6.6B tokens) and tracks perplexity, transfer, and forgetting across checkpoints. Main practical findings: continual pretraining consistently improves GPT-2 family models and outperforms standalone domain-adaptive pretraining; larger models get better final perplexity and forget less; smaller models learn and forget the most; randomizing domain order reduces forgetting and improves final checkpoints; Llama2-7B degrades unless domains are large (>100 MB). The study focuses on measuring dynamics rather than proposing fixes.

Problem Statement

Continual learning for LLMs has been studied mostly on fine-tuning or small-scale settings. There is no large-scale, realistic benchmark that measures how incremental pretraining across many domains affects knowledge accumulation, forgetting, and downstream transfer for different model families and sizes.

Main Contribution

A large continual-pretraining benchmark using M2D2 across 159 domains (6.6B tokens) and systematic evaluation of checkpoints.

Empirical study across multiple model families (GPT-2 sizes, Llama2-7B, RoBERTa) measuring forward/backward transfer, forgetting, and downstream task impact.

Key Findings

Continual pretraining reliably improves GPT-2 family perplexity and outperforms standalone domain-adaptive pretraining.

NumbersMeasured over 159 domains; CPT median better than DAPT across GPT2 sizes

Practical UseIf you maintain a fleet of domain experts, prefer continual pretraining with checkpointing at domain shifts over separate domain-adaptive runs.

Evidence RefSections 4.1, Figure 3

Llama2-7B degrades with additional (continual or domain-adaptive) pretraining on small domains.

NumbersImprovement rare below ~75100 MB domain size; needs >100 MB to improve

Practical UseAvoid extra pretraining of Llama2-7B on small domain corpora; aggregate domains or use larger corpora before adapting.

Evidence RefSection 4.1, Figure 17

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
benchmark scale159 domains; 6.6B tokensM2D2 (Wiki + S2ORC)Table 1; Sections 2–3Table 1, Section 2
total evaluations12,561 checkpoint-vs-domain tests per model per setupall domainsDiscussion, Section 6Section 6

What To Try In 7 Days

Run short continual pretraining on a small GPT family model and checkpoint per domain to gauge CPT vs DAPT.

Randomize domain order in trials to test retention before committing to a curriculum.

Check domain sizes; avoid adapting very large foundation models on domains <100MB without aggregation.

Optimization Features

System Optimization
DeepSpeed auto configuration
Training Optimization
checkpointing at domain shiftsbatch-size ablation (16 vs 64)

Reproducibility

Risks & Boundaries

Limitations

Similar-order curriculum was constructed only once; results may vary with alternative similar curricula.

RoBERTa may overlap with Wiki in training data, possibly biasing encoder results.

When Not To Use

Do not apply continual pretraining on a very large base model (e.g., Llama2-7B) using many small domains (<100 MB) without validation.

Do not expect continual pretraining to solve forgetting without additional retention strategies on long horizons.

Failure Modes

Catastrophic degradation for Llama2-7B when domain corpora are small.

Late-stage overfitting and increased forgetting as continual training fills capacity.

Core Entities

Models

GPT2-SGPT2-MGPT2-LGPT2-XLLlama2-7BRoBERTa-baseRoBERTa-large

Metrics

perplexityforward transferbackward transferforgetting (FG)normalized aggregate downstream scoreprediction token rank

Datasets

M2D2S2ORCWikipediaOpenWebText

Benchmarks

BIG-Bench (Arithmetic)BIG-Bench (General Knowledge)BIG-Bench (Physics)BIG-Bench (CS Algorithms)BIG-Bench (Few-shot NLG)