Large-scale benchmark: continual pretraining helps GPT models but can harm Llama2‑7B

February 27, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

3

Authors

Çağatay Yıldız, Nishaanth Kanna Ravichandran, Nitin Sharma, Matthias Bethge, Beyza Ermis

Links

Abstract / PDF

Why It Matters For Business

Continual pretraining can produce better domain experts and reduce repeated retraining costs for smaller models, but it carries heavy compute and can harm very large models unless domain corpora are large and relevant.

Summary TLDR

The paper builds a large continual-pretraining benchmark using M2D2 (159 domains, 6.6B tokens) and tracks perplexity, transfer, and forgetting across checkpoints. Main practical findings: continual pretraining consistently improves GPT-2 family models and outperforms standalone domain-adaptive pretraining; larger models get better final perplexity and forget less; smaller models learn and forget the most; randomizing domain order reduces forgetting and improves final checkpoints; Llama2-7B degrades unless domains are large (>100 MB). The study focuses on measuring dynamics rather than proposing fixes.

Problem Statement

Continual learning for LLMs has been studied mostly on fine-tuning or small-scale settings. There is no large-scale, realistic benchmark that measures how incremental pretraining across many domains affects knowledge accumulation, forgetting, and downstream transfer for different model families and sizes.

Main Contribution

A large continual-pretraining benchmark using M2D2 across 159 domains (6.6B tokens) and systematic evaluation of checkpoints.

Empirical study across multiple model families (GPT-2 sizes, Llama2-7B, RoBERTa) measuring forward/backward transfer, forgetting, and downstream task impact.

Practical analyses of curriculum effects, domain order (similar vs random), domain size thresholds, batch-size ablation, and a rank-based metric for knowledge accumulation.

Key Findings

Continual pretraining reliably improves GPT-2 family perplexity and outperforms standalone domain-adaptive pretraining.

NumbersMeasured over 159 domains; CPT median better than DAPT across GPT2 sizes

Llama2-7B degrades with additional (continual or domain-adaptive) pretraining on small domains.

NumbersImprovement rare below ~75–100 MB domain size; needs >100 MB to improve

Larger models achieve better final perplexity and generally forget less than smaller models.

NumbersConsistent size–performance trend across metrics (DAPT, CPT, LC); trend visible in all plots

Smaller models show the biggest gains from continual pretraining but also the largest forgetting.

NumbersImprovement inversely correlated with model size; GPT2-S > GPT2-M > GPT2-L in gains

Randomizing the order of domain training reduces forgetting and yields better final checkpoints than a single similar-order curriculum.

NumbersRandom-order produced superior LC and more stable backward transfer across models

Continual pretraining improves downstream task performance when recent training domains match the task; it can hurt when domains are unrelated.

NumbersBIG-Bench tasks show task peaks after related domain training; Llama2-7B often falls to chance

Building the benchmark was computationally heavy: many checkpoints and evaluations.

Numbers12,561 checkpoint-vs-domain evaluations; GPT2-S run ~6 days; Llama2-7B run ~4 months

Results

benchmark scale

Value159 domains; 6.6B tokens

total evaluations

Value12,561 checkpoint-vs-domain tests per model per setup

domain-size threshold for Llama2-7B

Value≈75–100 MB

examples of run time

ValueGPT2-S ~6 days; Llama2-7B ~4 months (single run)

Who Should Care

What To Try In 7 Days

Run short continual pretraining on a small GPT family model and checkpoint per domain to gauge CPT vs DAPT.

Randomize domain order in trials to test retention before committing to a curriculum.

Check domain sizes; avoid adapting very large foundation models on domains <100MB without aggregation.

Optimization Features

System Optimization

  • DeepSpeed auto configuration

Training Optimization

  • checkpointing at domain shifts
  • batch-size ablation (16 vs 64)

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Similar-order curriculum was constructed only once; results may vary with alternative similar curricula.
  • RoBERTa may overlap with Wiki in training data, possibly biasing encoder results.
  • High compute cost: long runs and exhaustive evaluations limit reproducibility for many teams.
  • Study measures dynamics and benchmarks; it does not provide new methods to mitigate forgetting.

When Not To Use

  • Do not apply continual pretraining on a very large base model (e.g., Llama2-7B) using many small domains (<100 MB) without validation.
  • Do not expect continual pretraining to solve forgetting without additional retention strategies on long horizons.
  • Avoid using similar-order curriculum blindly when broad transfer is desired; random order can be safer.

Failure Modes

  • Catastrophic degradation for Llama2-7B when domain corpora are small.
  • Late-stage overfitting and increased forgetting as continual training fills capacity.
  • Negative forward transfer to unseen portions (e.g., Wiki) despite gains on larger portions (S2ORC).
  • Evaluation sensitivity: single random seed/order can produce variability in outcomes.

Core Entities

Models

  • GPT2-S
  • GPT2-M
  • GPT2-L
  • GPT2-XL
  • Llama2-7B
  • RoBERTa-base
  • RoBERTa-large

Metrics

  • perplexity
  • forward transfer
  • backward transfer
  • forgetting (FG)
  • normalized aggregate downstream score
  • prediction token rank

Datasets

  • M2D2
  • S2ORC
  • Wikipedia
  • OpenWebText

Benchmarks

  • BIG-Bench (Arithmetic)
  • BIG-Bench (General Knowledge)
  • BIG-Bench (Physics)
  • BIG-Bench (CS Algorithms)
  • BIG-Bench (Few-shot NLG)