Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Continual learning lets you update LLMs for new data or regulations with far less compute than retraining, reducing cost and speeding domain rollouts.
Summary TLDR
This survey maps continual learning (CL) methods for large language models across three stages: continual pre-training (updating base knowledge), continual fine-tuning (updating task skills), and continual alignment (updating values/preferences). It groups methods into rehearsal, regularization, architecture, data-augmentation and process-optimization families. The paper reviews benchmarks and metrics (average performance, forgetting, forward/backward transfer), notes concrete resource gains from some CL techniques (examples: 10% data cost, 40% training resources), and highlights open gaps: catastrophic forgetting, limited transfer, online/multimodal CL, and evaluation blind spots.
Problem Statement
LLMs are trained once on static corpora but the world and user needs change continuously. Re-training large models from scratch is too costly. We need methods that let LLMs acquire new knowledge and preferences over time without losing old capabilities.
Main Contribution
Organizes continual learning for LLMs into three practical stages: continual pre-training, continual fine-tuning, continual alignment.
Expands canonical CL taxonomy (rehearsal, regularization, architecture) and subcategorizes methods by forgetting-mitigation mechanism.
Summarizes evaluation metrics and benchmarks and pinpoints core challenges and research opportunities (online CL, multimodal CL, semi-parametric hybrids).
Key Findings
Some data-augmentation and selection strategies dramatically cut pretraining cost.
Domain continual pre-training can match or beat stronger models with less compute.
CL evaluation focuses on four metrics to capture learning and forgetting.
Results
data cost for continual pre-training
training resources for domain model
Who Should Care
What To Try In 7 Days
Measure forgetting: run AP and Forgetting Rate on your task stream to get a baseline.
Apply LoRA or other PEFT adapters for a new task to avoid full-model retraining.
Try small-scale continual pretraining with curated domain data (10% focused selection) to test cost vs gain.
Optimization Features
Token Efficiency
- instruction-synthesized data to improve signal per token
Infra Optimization
- mixing small curated corpora rather than full re-pretraining
Model Optimization
- MoE
- LoRA
System Optimization
- freeze-base + train small modules to cut compute
Training Optimization
- data selection/augmentation (ETS-DACP, ETA-DACP)
- pre-instruction tuning (PIT) to reduce instability
- process optimization to decouple format alignment from knowledge
Inference Optimization
- adapter routing and memory composition (routing networks, switch models)
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Survey compiles existing studies but does not provide new unified benchmarks or code.
- Some cited results (resource reductions, benchmark wins) come from individual papers and may not generalize.
- Evaluation blind spots include multimodal CL, truly online streaming, and real-world privacy constraints.
When Not To Use
- If you require provable, worst-case guarantees on forgetting — CL methods are largely empirical.
- If you can afford full re-pretraining and need maximal global consistency.
Failure Modes
- Catastrophic forgetting still occurs under some CPT and fine-tuning schedules.
- Poorly curated replay or synthetic samples can induce distributional bias.
- Adapter routing or module-switching can mis-route and degrade performance on certain task types.
Core Entities
Models
- LLaMA
- LLaMA2
- LLaMA3
- GPT-4
- Claude
- Qwen
- Pythia
- GLaM
- RoBERTa
- BERT
- T5
- BLOOMZ
- Mistral-7B
Metrics
- Average Performance (AP)
- Forgetting Rate
- Forward Transfer Rate (FWT)
- Backward Transfer Rate (BWT)
- FUAR
Datasets
- Common Crawl
- MMLU
- GSM8K
- domain corpora (medical, legal, financial)
Benchmarks
- TRACE
- CITB
- InstrDialog / InstrDialog++
- InvariantLama / UpdatedLama / NewLama
- StreamBench
- long-sequence 15-task benchmark (Razdaibiedina et al.)
Context Entities
Models
- Mix-CPT
- Llama3-Physician
- FinPythia-6.9B
Metrics
- training data cost
- training resource fraction
Datasets
- reading-comprehension converted corpora
- instruction-following seeds
Benchmarks
- TRACE
- CITB
- StreamBench

