Practical survey: how to keep LLMs up-to-date via continual pretraining, instruction tuning, and alignment

Overview

Decision SnapshotNeeds Validation

This is a literature survey synthesizing prior work. It is useful for planning continual updates but does not present new experimental proof.

Citations23

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 1/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, Gholamreza Haffari

Links

Abstract / PDF

Why It Matters For Business

Continual learning lets LLMs stay current with facts, tools and user values without full retraining, saving time and money while reducing model downtime.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This survey maps continual learning for large language models (LLMs) into three practical stages: continual pre-training (update facts, domains, languages), continual instruction tuning (teach new tasks, domains, tools), and continual alignment (update values and preferences). It summarizes methods (replay, regularization, dynamic architectures, parameter-efficient tuning like LoRA/prompts/adapters), benchmarks (TemporalWiki, TRACE, CITB, SHP, HH), and evaluation metrics (FWT, BWT, average performance, GAD/IFD/SD). Key challenges: catastrophic and cross-stage forgetting, compute cost, lack of alignment benchmarks, and need for controllable forgetting and history tracking.

Problem Statement

LLMs are costly to retrain but must be updated for new facts, domains, tools, languages and shifting human values. Existing continual learning (CL) methods for smaller models do not transfer cleanly to LLMs. Major problems are catastrophic forgetting, cross-stage forgetting between pretraining/finetuning/alignment, high compute, and scarce standard benchmarks for continual alignment.

Main Contribution

Organizes continual learning for LLMs into three stages: continual pre-training, instruction tuning, and alignment.

Provides a taxonomy by stage and by the type of information updated (facts, domains, tasks, skills, values, preferences).

Key Findings

Continual learning for LLMs is multi-stage: continual pretraining, instruction tuning, and alignment.

Practical UseDesign updates per stage: use CPT for facts/domains/languages, CIT for tasks/tools, CA for values/preferences; avoid mixing stages without safeguards.

Evidence RefSections 1, 2.3, Figures 1-3

Catastrophic forgetting and cross-stage forgetting are common when updating LLMs.

Practical UseMeasure BWT and GAD/IFD/SD after each update; keep replay buffers or parameter-isolation to preserve past skills.

Evidence RefSections 2.2, 2.3, 7.2 (cross-stage forgetting discussion)

What To Try In 7 Days

Run a small CPT pass on a recent domain corpus (hours to days) and measure GAD/IFD/SD.

Prototype LoRA or adapter updates for one workflow to test BWT versus full finetune.

Use CITB or a subset of SuperNI to simulate incremental instruction updates and track FWT/BWT daily metrics.

Optimization Features

Token Efficiency

progressive prompts (learn tokens not weights)

Model Optimization

LoRAadaptersblock expansion (Llama PRO)

System Optimization

parameter-efficient tuning (PET) to cut compute

Training Optimization

Progressive PromptsDual Attention (DAPT)soft-masking for domain updatesrehearsal/replay buffers

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

No new experimental results — survey only.

Limited theoretical analysis of multi-stage continual learning.

When Not To Use

When on-the-fly retrieval (RAG) already meets update needs.

For very small models where simple finetuning suffices.

Failure Modes

Catastrophic forgetting of earlier tasks

Cross-stage forgetting when switching between CPT/CIT/CA

Core Entities

Models

ChatGPTLLaMAFinPythia-6.9BLlama PROLlemmaEcomGPT-CT

Metrics

FWTBWTAverage PerformanceGADIFDSD

Datasets

TemporalWikiFirehoseCKLTRACECITBConTinTinSuperNISHPHH

Benchmarks

TemporalWikiTRACECITBConTinTin

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Continual learning for LLMs is multi-stage: continual pretraining, instruction tuning, and alignment.

Catastrophic forgetting and cross-stage forgetting are common when updating LLMs.

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding

Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Key finding

Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

Key finding

Survey: how to update LLMs continuously without full retraining

Key finding