Overview
Production Readiness
0.7
Novelty Score
0.45
Cost Impact Score
0.8
Citation Count
3
Why It Matters For Business
You can update large LLMs on fresh data at far lower compute cost than full re-training while keeping model quality similar, cutting operational cost and turnaround time for model updates.
Summary TLDR
The paper shows that three simple, scalable tricks—learning-rate (LR) re-warming, LR re-decaying, and replaying a small fraction of past data—let you continually update decoder-only transformer LLMs (405M and 10B parameters) on hundreds of billions of new tokens while matching the performance of full re-training on pooled data. Re-warming improves adaptation but causes forgetting; adding modest replay (e.g., 1–5% for similar-language updates, ~25% for stronger shifts like adding German) recovers past performance. The authors also propose 'infinite' LR schedules that avoid re-warming and match cosine-decay performance in initial tests.
Problem Statement
Re-training large language models from scratch whenever new pre-training data arrives is costly. Naively continuing training can either not adapt to new data or erase previous knowledge (catastrophic forgetting). The paper asks: can simple, cheap continual pre-training rules match the performance of full re-training on the union of datasets?
Main Contribution
Show that LR re-warming + LR re-decaying is necessary to adapt LLMs when continuing pre-training from a low final LR.
Demonstrate that compute-equivalent replay (small fraction of past tokens) largely prevents forgetting while preserving adaptation.
Empirically match full re-training performance (validation loss and average evaluation scores) at 405M and 10B scales across weak (English→English) and strong (English→German) shifts, while using much less compute.
Propose 'infinite' LR schedules (constant LR phase + final anneal) as an alternative that avoids re-warming and works well in early experiments.
Key Findings
Re-warming then re-decaying the learning rate is required to adapt well to new pre-training data.
Small replay fractions substantially reduce forgetting with little hit to adaptation.
Combining LR re-warming, re-decay, and modest replay matches full re-training on pooled data in average metrics.
Excessive replay trades adaptation for retention: very large replay (50%) can hurt adapting to the new data.
Re-warming itself causes a transient loss spike even when training on the same data.
Infinite LR schedules (constant LR phase + final anneal) can match cosine-decay performance in initial experiments and remove the need to re-warm.
Results
Average final validation loss (avg of D0 and D1)
Average final validation loss (avg of D0 and D1)
Forgetting reduced by small replay
Large replay trade-off
Accuracy
Who Should Care
What To Try In 7 Days
If you have a pre-trained checkpoint that ended at a low LR, re-warm the LR and re-decay it while continuing on new data.
Add compute-equivalent replay of past data; start with 1–5% for similar-language updates, test ~25% for big shifts.
Run small-scale tests: compare final validation loss and a few downstream tasks to a union-trained baseline to pick replay percent and LR max.
Optimization Features
Token Efficiency
- compute-equivalent replay reduces unique new tokens seen to preserve budget
Training Optimization
- learning rate re-warming
- learning rate re-decaying (cosine fit to token budget)
- compute-equivalent replay (fractional past-data replay)
- infinite learning rate schedules (constant phase + final anneal)
Reproducibility
Code Urls
Data Urls
- Pile (Gao et al., 2020)
- SlimPajama dataset (Soboleva et al., 2023)
- Oscar German CommonCrawl (Laippala et al., 2022)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only two model sizes tested (405M, 10B); behavior at 100B+ unknown.
- German validation set was not deduplicated from German training data (possible contamination).
- Most experiments focus on two sequential datasets; many-update scenarios are less explored.
- No multi-seed runs reported; reported numbers may have stochastic variance.
- Infinite LR schedule tests limited to IID splits (no distribution shifts) at 405M scale.
When Not To Use
- When the new data is a strict per-domain sequential stream (domain-incremental) without mixing—paper found poor results for that setting.
- When the tokenizer cannot cover the new distribution (e.g., adding a very different language) without tokenization changes.
- If you need guaranteed improvements on every downstream task—small metric differences can vary by task.
Failure Modes
- LR re-warming can cause transient spikes in past-data loss and accelerate forgetting if not paired with replay.
- Too much replay (large fraction like 50%) can hurt adaptation to new data.
- Infinite LR schedules may be suboptimal over very long continual runs or when datasets have strong shifts (not fully evaluated).
Core Entities
Models
- 405M decoder-only transformer (GPT-NeoX architecture)
- 9.6B (~10B) decoder-only transformer
Metrics
- validation loss (nats)
- Accuracy
Datasets
- Pile (English)
- SlimPajama (English subset used, deduplicated)
- German Common Crawl (Oscar)
Benchmarks
- HellaSwag, Winogrande, PIQA, OpenBookQA, ARC-Easy/Challenge, NaturalQuestions, TriviaQA, BoolQ, Math
- German translations for selected tasks (HellaSwag-DE, ARC-Challenge-DE, TriviaQA-DE, MMLU-DE)

