Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
23
Why It Matters For Business
Continual learning lets LLMs stay current with facts, tools and user values without full retraining, saving time and money while reducing model downtime.
Summary TLDR
This survey maps continual learning for large language models (LLMs) into three practical stages: continual pre-training (update facts, domains, languages), continual instruction tuning (teach new tasks, domains, tools), and continual alignment (update values and preferences). It summarizes methods (replay, regularization, dynamic architectures, parameter-efficient tuning like LoRA/prompts/adapters), benchmarks (TemporalWiki, TRACE, CITB, SHP, HH), and evaluation metrics (FWT, BWT, average performance, GAD/IFD/SD). Key challenges: catastrophic and cross-stage forgetting, compute cost, lack of alignment benchmarks, and need for controllable forgetting and history tracking.
Problem Statement
LLMs are costly to retrain but must be updated for new facts, domains, tools, languages and shifting human values. Existing continual learning (CL) methods for smaller models do not transfer cleanly to LLMs. Major problems are catastrophic forgetting, cross-stage forgetting between pretraining/finetuning/alignment, high compute, and scarce standard benchmarks for continual alignment.
Main Contribution
Organizes continual learning for LLMs into three stages: continual pre-training, instruction tuning, and alignment.
Provides a taxonomy by stage and by the type of information updated (facts, domains, tasks, skills, values, preferences).
Surveys representative methods per stage: datasets, replay/regularization/architecture, and parameter-efficient tuning.
Summarizes benchmarks and evaluation metrics for continual learning and cross-stage forgetting.
Lists open challenges and practical future directions (compute-efficiency, controllable forgetting, automatic continual learning, history tracking).
Key Findings
Continual learning for LLMs is multi-stage: continual pretraining, instruction tuning, and alignment.
Catastrophic forgetting and cross-stage forgetting are common when updating LLMs.
Parameter-efficient approaches reduce compute and help retain past abilities (examples: Progressive Prompts, LoRA, adapters).
Benchmarks exist for different stages: TemporalWiki (CPT), TRACE and CITB (CIT), SHP/HH used for alignment experiments.
Evaluation needs both continual-learning metrics (FWT, BWT, average) and cross-stage metrics (GAD, IFD, SD).
Who Should Care
What To Try In 7 Days
Run a small CPT pass on a recent domain corpus (hours to days) and measure GAD/IFD/SD.
Prototype LoRA or adapter updates for one workflow to test BWT versus full finetune.
Use CITB or a subset of SuperNI to simulate incremental instruction updates and track FWT/BWT daily metrics.
Optimization Features
Token Efficiency
- progressive prompts (learn tokens not weights)
Model Optimization
- LoRA
- adapters
- block expansion (Llama PRO)
System Optimization
- parameter-efficient tuning (PET) to cut compute
Training Optimization
- Progressive Prompts
- Dual Attention (DAPT)
- soft-masking for domain updates
- rehearsal/replay buffers
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- No new experimental results — survey only.
- Limited theoretical analysis of multi-stage continual learning.
- Alignment benchmarks and standardized protocols remain scarce.
- Computational cost and system-level recipes are high-level, not prescriptive.
When Not To Use
- When on-the-fly retrieval (RAG) already meets update needs.
- For very small models where simple finetuning suffices.
- If you lack resources for evaluation of cross-stage forgetting.
Failure Modes
- Catastrophic forgetting of earlier tasks
- Cross-stage forgetting when switching between CPT/CIT/CA
- Degraded safety after instruction tuning
- Benchmark contamination or task leakage during evaluation
Core Entities
Models
- ChatGPT
- LLaMA
- FinPythia-6.9B
- Llama PRO
- Llemma
- EcomGPT-CT
Metrics
- FWT
- BWT
- Average Performance
- GAD
- IFD
- SD
Datasets
- TemporalWiki
- Firehose
- CKL
- TRACE
- CITB
- ConTinTin
- SuperNI
- SHP
- HH
Benchmarks
- TemporalWiki
- TRACE
- CITB
- ConTinTin

