Overview
This is a comprehensive survey synthesizing many empirical studies; actionable patterns (PEFT, rehearsal, data selection) have repeated empirical support but specific gains depend on domain and setup.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 2/2
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
Continual learning lets you update LLMs for new data or regulations with far less compute than retraining, reducing cost and speeding domain rollouts.
Who Should Care
Summary TLDR
This survey maps continual learning (CL) methods for large language models across three stages: continual pre-training (updating base knowledge), continual fine-tuning (updating task skills), and continual alignment (updating values/preferences). It groups methods into rehearsal, regularization, architecture, data-augmentation and process-optimization families. The paper reviews benchmarks and metrics (average performance, forgetting, forward/backward transfer), notes concrete resource gains from some CL techniques (examples: 10% data cost, 40% training resources), and highlights open gaps: catastrophic forgetting, limited transfer, online/multimodal CL, and evaluation blind spots.
Problem Statement
LLMs are trained once on static corpora but the world and user needs change continuously. Re-training large models from scratch is too costly. We need methods that let LLMs acquire new knowledge and preferences over time without losing old capabilities.
Main Contribution
Organizes continual learning for LLMs into three practical stages: continual pre-training, continual fine-tuning, continual alignment.
Expands canonical CL taxonomy (rehearsal, regularization, architecture) and subcategorizes methods by forgetting-mitigation mechanism.
Key Findings
Some data-augmentation and selection strategies dramatically cut pretraining cost.
Domain continual pre-training can match or beat stronger models with less compute.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| data cost for continual pre-training | 10% of vanilla CPT data | vanilla continual pre-training | ≈10x less data | FinPythia / financial domain | FinPythia achieves comparable performance using efficient data selection | Xie et al. [36] |
| training resources for domain model | 40% of typical resources | comparable GPT-4 domain performance | uses 60% less compute | medical domain (Llama3-Physician) | Llama3-Physician outperforms GPT-4 on several medical benchmarks with 40% training resources | Guo et al. [37] |
What To Try In 7 Days
Measure forgetting: run AP and Forgetting Rate on your task stream to get a baseline.
Apply LoRA or other PEFT adapters for a new task to avoid full-model retraining.
Try small-scale continual pretraining with curated domain data (10% focused selection) to test cost vs gain.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Survey compiles existing studies but does not provide new unified benchmarks or code.
Some cited results (resource reductions, benchmark wins) come from individual papers and may not generalize.
When Not To Use
If you require provable, worst-case guarantees on forgetting — CL methods are largely empirical.
If you can afford full re-pretraining and need maximal global consistency.
Failure Modes
Catastrophic forgetting still occurs under some CPT and fine-tuning schedules.
Poorly curated replay or synthetic samples can induce distributional bias.

