Match expensive re-training by re-warming/decaying the LR plus replay to update LLMs efficiently

March 13, 20249 min

Overview

Production Readiness

0.7

Novelty Score

0.45

Cost Impact Score

0.8

Citation Count

3

Authors

Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish

Links

Abstract / PDF

Why It Matters For Business

You can update large LLMs on fresh data at far lower compute cost than full re-training while keeping model quality similar, cutting operational cost and turnaround time for model updates.

Summary TLDR

The paper shows that three simple, scalable tricks—learning-rate (LR) re-warming, LR re-decaying, and replaying a small fraction of past data—let you continually update decoder-only transformer LLMs (405M and 10B parameters) on hundreds of billions of new tokens while matching the performance of full re-training on pooled data. Re-warming improves adaptation but causes forgetting; adding modest replay (e.g., 1–5% for similar-language updates, ~25% for stronger shifts like adding German) recovers past performance. The authors also propose 'infinite' LR schedules that avoid re-warming and match cosine-decay performance in initial tests.

Problem Statement

Re-training large language models from scratch whenever new pre-training data arrives is costly. Naively continuing training can either not adapt to new data or erase previous knowledge (catastrophic forgetting). The paper asks: can simple, cheap continual pre-training rules match the performance of full re-training on the union of datasets?

Main Contribution

Show that LR re-warming + LR re-decaying is necessary to adapt LLMs when continuing pre-training from a low final LR.

Demonstrate that compute-equivalent replay (small fraction of past tokens) largely prevents forgetting while preserving adaptation.

Empirically match full re-training performance (validation loss and average evaluation scores) at 405M and 10B scales across weak (English→English) and strong (English→German) shifts, while using much less compute.

Propose 'infinite' LR schedules (constant LR phase + final anneal) as an alternative that avoids re-warming and works well in early experiments.

Key Findings

Re-warming then re-decaying the learning rate is required to adapt well to new pre-training data.

Small replay fractions substantially reduce forgetting with little hit to adaptation.

NumbersPile→German: avg loss 2.34→1.97 with 1% replay

Combining LR re-warming, re-decay, and modest replay matches full re-training on pooled data in average metrics.

Numbers10B avg loss: 1.89 (5% replay) vs 1.87 (union); 405M avg loss: 2.37 vs 2.35

Excessive replay trades adaptation for retention: very large replay (50%) can hurt adapting to the new data.

NumbersSlimPajama: final D1 loss increases at 50% replay vs lower replay

Re-warming itself causes a transient loss spike even when training on the same data.

NumbersPile validation loss peak +0.1 (η_max=3e-4) when re-warming on Pile

Infinite LR schedules (constant LR phase + final anneal) can match cosine-decay performance in initial experiments and remove the need to re-warm.

NumbersSimilar final validation loss on 300B SlimPajama (405M)

Results

Average final validation loss (avg of D0 and D1)

Value10B: 1.89 (300B Pile→300B SP, 5% replay)

Baseline600B union-trained: 1.87

Average final validation loss (avg of D0 and D1)

Value405M: 2.37 (300B Pile→300B SP, 5% replay)

Baseline600B union-trained: 2.35

Forgetting reduced by small replay

ValuePile→German avg loss: 2.34 (0% replay) → 1.97 (1% replay)

Baseline500B union-trained avg 1.75

Large replay trade-off

Value50% replay reduces forgetting further but increases D1 loss (worse adaptation)

BaselineLower replay percentages

Accuracy

Value47.68 (300B Pile→300B SP, 5% replay)

Baseline600B union-trained: 48.00

Who Should Care

What To Try In 7 Days

If you have a pre-trained checkpoint that ended at a low LR, re-warm the LR and re-decay it while continuing on new data.

Add compute-equivalent replay of past data; start with 1–5% for similar-language updates, test ~25% for big shifts.

Run small-scale tests: compare final validation loss and a few downstream tasks to a union-trained baseline to pick replay percent and LR max.

Optimization Features

Token Efficiency

  • compute-equivalent replay reduces unique new tokens seen to preserve budget

Training Optimization

  • learning rate re-warming
  • learning rate re-decaying (cosine fit to token budget)
  • compute-equivalent replay (fractional past-data replay)
  • infinite learning rate schedules (constant phase + final anneal)

Reproducibility

Data Urls

  • Pile (Gao et al., 2020)
  • SlimPajama dataset (Soboleva et al., 2023)
  • Oscar German CommonCrawl (Laippala et al., 2022)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only two model sizes tested (405M, 10B); behavior at 100B+ unknown.
  • German validation set was not deduplicated from German training data (possible contamination).
  • Most experiments focus on two sequential datasets; many-update scenarios are less explored.
  • No multi-seed runs reported; reported numbers may have stochastic variance.
  • Infinite LR schedule tests limited to IID splits (no distribution shifts) at 405M scale.

When Not To Use

  • When the new data is a strict per-domain sequential stream (domain-incremental) without mixing—paper found poor results for that setting.
  • When the tokenizer cannot cover the new distribution (e.g., adding a very different language) without tokenization changes.
  • If you need guaranteed improvements on every downstream task—small metric differences can vary by task.

Failure Modes

  • LR re-warming can cause transient spikes in past-data loss and accelerate forgetting if not paired with replay.
  • Too much replay (large fraction like 50%) can hurt adaptation to new data.
  • Infinite LR schedules may be suboptimal over very long continual runs or when datasets have strong shifts (not fully evaluated).

Core Entities

Models

  • 405M decoder-only transformer (GPT-NeoX architecture)
  • 9.6B (~10B) decoder-only transformer

Metrics

  • validation loss (nats)
  • Accuracy

Datasets

  • Pile (English)
  • SlimPajama (English subset used, deduplicated)
  • German Common Crawl (Oscar)

Benchmarks

  • HellaSwag, Winogrande, PIQA, OpenBookQA, ARC-Easy/Challenge, NaturalQuestions, TriviaQA, BoolQ, Math
  • German translations for selected tasks (HellaSwag-DE, ARC-Challenge-DE, TriviaQA-DE, MMLU-DE)