Survey: how to update LLMs continuously without full retraining

Overview

Decision SnapshotReady For Pilot

This is a comprehensive survey synthesizing many empirical studies; actionable patterns (PEFT, rehearsal, data selection) have repeated empirical support but specific gains depend on domain and setup.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/2

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Hongyang Chen, Zhongwu Sun, Hongfei Ye, Kunchi Li, Xuemin Lin

Links

Abstract / PDF

Why It Matters For Business

Continual learning lets you update LLMs for new data or regulations with far less compute than retraining, reducing cost and speeding domain rollouts.

Who Should Care

CTO Product Manager ML Engineer Founder Data Scientist

Summary TLDR

This survey maps continual learning (CL) methods for large language models across three stages: continual pre-training (updating base knowledge), continual fine-tuning (updating task skills), and continual alignment (updating values/preferences). It groups methods into rehearsal, regularization, architecture, data-augmentation and process-optimization families. The paper reviews benchmarks and metrics (average performance, forgetting, forward/backward transfer), notes concrete resource gains from some CL techniques (examples: 10% data cost, 40% training resources), and highlights open gaps: catastrophic forgetting, limited transfer, online/multimodal CL, and evaluation blind spots.

Problem Statement

LLMs are trained once on static corpora but the world and user needs change continuously. Re-training large models from scratch is too costly. We need methods that let LLMs acquire new knowledge and preferences over time without losing old capabilities.

Main Contribution

Organizes continual learning for LLMs into three practical stages: continual pre-training, continual fine-tuning, continual alignment.

Expands canonical CL taxonomy (rehearsal, regularization, architecture) and subcategorizes methods by forgetting-mitigation mechanism.

Key Findings

Some data-augmentation and selection strategies dramatically cut pretraining cost.

NumbersFinPythia: 10% of data cost vs vanilla CPT

Practical UseIf you need a domain LLM quickly, try targeted data selection/augmentation to get most gains with ~10% of data cost.

Evidence RefXie et al. [36]

Domain continual pre-training can match or beat stronger models with less compute.

NumbersLlama3-Physician: used 40% training resources and outperformed GPT-4 on several medical benchmarks

Practical UseFor domain adaptation, invest in domain-specific continual pre-training plus quality data instead of full retraining.

Evidence RefGuo et al. [37]

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
data cost for continual pre-training	10% of vanilla CPT data	vanilla continual pre-training	≈10x less data	FinPythia / financial domain	FinPythia achieves comparable performance using efficient data selection	Xie et al. [36]
training resources for domain model	40% of typical resources	comparable GPT-4 domain performance	uses 60% less compute	medical domain (Llama3-Physician)	Llama3-Physician outperforms GPT-4 on several medical benchmarks with 40% training resources	Guo et al. [37]

What To Try In 7 Days

Measure forgetting: run AP and Forgetting Rate on your task stream to get a baseline.

Apply LoRA or other PEFT adapters for a new task to avoid full-model retraining.

Try small-scale continual pretraining with curated domain data (10% focused selection) to test cost vs gain.

Optimization Features

Token Efficiency

instruction-synthesized data to improve signal per token

Infra Optimization

mixing small curated corpora rather than full re-pretraining

Model Optimization

MoELoRA

System Optimization

freeze-base + train small modules to cut compute

Training Optimization

data selection/augmentation (ETS-DACP, ETA-DACP)pre-instruction tuning (PIT) to reduce instabilityprocess optimization to decouple format alignment from knowledge

Inference Optimization

adapter routing and memory composition (routing networks, switch models)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Survey compiles existing studies but does not provide new unified benchmarks or code.

Some cited results (resource reductions, benchmark wins) come from individual papers and may not generalize.

When Not To Use

If you require provable, worst-case guarantees on forgetting — CL methods are largely empirical.

If you can afford full re-pretraining and need maximal global consistency.

Failure Modes

Catastrophic forgetting still occurs under some CPT and fine-tuning schedules.

Poorly curated replay or synthetic samples can induce distributional bias.

Core Entities

Models

LLaMALLaMA2LLaMA3GPT-4ClaudeQwenPythiaGLaMRoBERTaBERTT5BLOOMZMistral-7B

Metrics

Average Performance (AP)Forgetting RateForward Transfer Rate (FWT)Backward Transfer Rate (BWT)FUAR

Datasets

Common CrawlMMLUGSM8Kdomain corpora (medical, legal, financial)

Benchmarks

TRACECITBInstrDialog / InstrDialog++InvariantLama / UpdatedLama / NewLamaStreamBenchlong-sequence 15-task benchmark (Razdaibiedina et al.)

Context Entities

Models

Mix-CPTLlama3-PhysicianFinPythia-6.9B

Metrics

training data costtraining resource fraction

Datasets

reading-comprehension converted corporainstruction-following seeds

Benchmarks

TRACECITBStreamBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Some data-augmentation and selection strategies dramatically cut pretraining cost.

Domain continual pre-training can match or beat stronger models with less compute.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding

Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Key finding

Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

Key finding

Use server-side multimodal LLMs to bootstrap federated learning on heterogeneous, long-tailed image data

Key finding