Survey: how to update LLMs continuously without full retraining

March 13, 20267 min

Overview

Decision SnapshotReady For Pilot

This is a comprehensive survey synthesizing many empirical studies; actionable patterns (PEFT, rehearsal, data selection) have repeated empirical support but specific gains depend on domain and setup.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/2

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Hongyang Chen, Zhongwu Sun, Hongfei Ye, Kunchi Li, Xuemin Lin

Links

Abstract / PDF

Why It Matters For Business

Continual learning lets you update LLMs for new data or regulations with far less compute than retraining, reducing cost and speeding domain rollouts.

Who Should Care

Summary TLDR

This survey maps continual learning (CL) methods for large language models across three stages: continual pre-training (updating base knowledge), continual fine-tuning (updating task skills), and continual alignment (updating values/preferences). It groups methods into rehearsal, regularization, architecture, data-augmentation and process-optimization families. The paper reviews benchmarks and metrics (average performance, forgetting, forward/backward transfer), notes concrete resource gains from some CL techniques (examples: 10% data cost, 40% training resources), and highlights open gaps: catastrophic forgetting, limited transfer, online/multimodal CL, and evaluation blind spots.

Problem Statement

LLMs are trained once on static corpora but the world and user needs change continuously. Re-training large models from scratch is too costly. We need methods that let LLMs acquire new knowledge and preferences over time without losing old capabilities.

Main Contribution

Organizes continual learning for LLMs into three practical stages: continual pre-training, continual fine-tuning, continual alignment.

Expands canonical CL taxonomy (rehearsal, regularization, architecture) and subcategorizes methods by forgetting-mitigation mechanism.

Key Findings

Some data-augmentation and selection strategies dramatically cut pretraining cost.

NumbersFinPythia: 10% of data cost vs vanilla CPT

Practical UseIf you need a domain LLM quickly, try targeted data selection/augmentation to get most gains with ~10% of data cost.

Evidence RefXie et al. [36]

Domain continual pre-training can match or beat stronger models with less compute.

NumbersLlama3-Physician: used 40% training resources and outperformed GPT-4 on several medical benchmarks

Practical UseFor domain adaptation, invest in domain-specific continual pre-training plus quality data instead of full retraining.

Evidence RefGuo et al. [37]

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
data cost for continual pre-training10% of vanilla CPT datavanilla continual pre-training≈10x less dataFinPythia / financial domainFinPythia achieves comparable performance using efficient data selectionXie et al. [36]
training resources for domain model40% of typical resourcescomparable GPT-4 domain performanceuses 60% less computemedical domain (Llama3-Physician)Llama3-Physician outperforms GPT-4 on several medical benchmarks with 40% training resourcesGuo et al. [37]

What To Try In 7 Days

Measure forgetting: run AP and Forgetting Rate on your task stream to get a baseline.

Apply LoRA or other PEFT adapters for a new task to avoid full-model retraining.

Try small-scale continual pretraining with curated domain data (10% focused selection) to test cost vs gain.

Optimization Features

Token Efficiency
instruction-synthesized data to improve signal per token
Infra Optimization
mixing small curated corpora rather than full re-pretraining
Model Optimization
MoELoRA
System Optimization
freeze-base + train small modules to cut compute
Training Optimization
data selection/augmentation (ETS-DACP, ETA-DACP)pre-instruction tuning (PIT) to reduce instabilityprocess optimization to decouple format alignment from knowledge
Inference Optimization
adapter routing and memory composition (routing networks, switch models)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Survey compiles existing studies but does not provide new unified benchmarks or code.

Some cited results (resource reductions, benchmark wins) come from individual papers and may not generalize.

When Not To Use

If you require provable, worst-case guarantees on forgetting — CL methods are largely empirical.

If you can afford full re-pretraining and need maximal global consistency.

Failure Modes

Catastrophic forgetting still occurs under some CPT and fine-tuning schedules.

Poorly curated replay or synthetic samples can induce distributional bias.

Core Entities

Models

LLaMALLaMA2LLaMA3GPT-4ClaudeQwenPythiaGLaMRoBERTaBERTT5BLOOMZMistral-7B

Metrics

Average Performance (AP)Forgetting RateForward Transfer Rate (FWT)Backward Transfer Rate (BWT)FUAR

Datasets

Common CrawlMMLUGSM8Kdomain corpora (medical, legal, financial)

Benchmarks

TRACECITBInstrDialog / InstrDialog++InvariantLama / UpdatedLama / NewLamaStreamBenchlong-sequence 15-task benchmark (Razdaibiedina et al.)

Context Entities

Models

Mix-CPTLlama3-PhysicianFinPythia-6.9B

Metrics

training data costtraining resource fraction

Datasets

reading-comprehension converted corporainstruction-following seeds

Benchmarks

TRACECITBStreamBench