Survey: how to update LLMs continuously without full retraining

March 13, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

0

Authors

Hongyang Chen, Zhongwu Sun, Hongfei Ye, Kunchi Li, Xuemin Lin

Links

Abstract / PDF

Why It Matters For Business

Continual learning lets you update LLMs for new data or regulations with far less compute than retraining, reducing cost and speeding domain rollouts.

Summary TLDR

This survey maps continual learning (CL) methods for large language models across three stages: continual pre-training (updating base knowledge), continual fine-tuning (updating task skills), and continual alignment (updating values/preferences). It groups methods into rehearsal, regularization, architecture, data-augmentation and process-optimization families. The paper reviews benchmarks and metrics (average performance, forgetting, forward/backward transfer), notes concrete resource gains from some CL techniques (examples: 10% data cost, 40% training resources), and highlights open gaps: catastrophic forgetting, limited transfer, online/multimodal CL, and evaluation blind spots.

Problem Statement

LLMs are trained once on static corpora but the world and user needs change continuously. Re-training large models from scratch is too costly. We need methods that let LLMs acquire new knowledge and preferences over time without losing old capabilities.

Main Contribution

Organizes continual learning for LLMs into three practical stages: continual pre-training, continual fine-tuning, continual alignment.

Expands canonical CL taxonomy (rehearsal, regularization, architecture) and subcategorizes methods by forgetting-mitigation mechanism.

Summarizes evaluation metrics and benchmarks and pinpoints core challenges and research opportunities (online CL, multimodal CL, semi-parametric hybrids).

Key Findings

Some data-augmentation and selection strategies dramatically cut pretraining cost.

NumbersFinPythia: 10% of data cost vs vanilla CPT

Domain continual pre-training can match or beat stronger models with less compute.

NumbersLlama3-Physician: used 40% training resources and outperformed GPT-4 on several medical benchmarks

CL evaluation focuses on four metrics to capture learning and forgetting.

NumbersAP, Forgetting Rate, Forward Transfer (FWT), Backward Transfer (BWT)

Results

data cost for continual pre-training

Value10% of vanilla CPT data

Baselinevanilla continual pre-training

training resources for domain model

Value40% of typical resources

Baselinecomparable GPT-4 domain performance

Who Should Care

What To Try In 7 Days

Measure forgetting: run AP and Forgetting Rate on your task stream to get a baseline.

Apply LoRA or other PEFT adapters for a new task to avoid full-model retraining.

Try small-scale continual pretraining with curated domain data (10% focused selection) to test cost vs gain.

Optimization Features

Token Efficiency

  • instruction-synthesized data to improve signal per token

Infra Optimization

  • mixing small curated corpora rather than full re-pretraining

Model Optimization

  • MoE
  • LoRA

System Optimization

  • freeze-base + train small modules to cut compute

Training Optimization

  • data selection/augmentation (ETS-DACP, ETA-DACP)
  • pre-instruction tuning (PIT) to reduce instability
  • process optimization to decouple format alignment from knowledge

Inference Optimization

  • adapter routing and memory composition (routing networks, switch models)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Survey compiles existing studies but does not provide new unified benchmarks or code.
  • Some cited results (resource reductions, benchmark wins) come from individual papers and may not generalize.
  • Evaluation blind spots include multimodal CL, truly online streaming, and real-world privacy constraints.

When Not To Use

  • If you require provable, worst-case guarantees on forgetting — CL methods are largely empirical.
  • If you can afford full re-pretraining and need maximal global consistency.

Failure Modes

  • Catastrophic forgetting still occurs under some CPT and fine-tuning schedules.
  • Poorly curated replay or synthetic samples can induce distributional bias.
  • Adapter routing or module-switching can mis-route and degrade performance on certain task types.

Core Entities

Models

  • LLaMA
  • LLaMA2
  • LLaMA3
  • GPT-4
  • Claude
  • Qwen
  • Pythia
  • GLaM
  • RoBERTa
  • BERT
  • T5
  • BLOOMZ
  • Mistral-7B

Metrics

  • Average Performance (AP)
  • Forgetting Rate
  • Forward Transfer Rate (FWT)
  • Backward Transfer Rate (BWT)
  • FUAR

Datasets

  • Common Crawl
  • MMLU
  • GSM8K
  • domain corpora (medical, legal, financial)

Benchmarks

  • TRACE
  • CITB
  • InstrDialog / InstrDialog++
  • InvariantLama / UpdatedLama / NewLama
  • StreamBench
  • long-sequence 15-task benchmark (Razdaibiedina et al.)

Context Entities

Models

  • Mix-CPT
  • Llama3-Physician
  • FinPythia-6.9B

Metrics

  • training data cost
  • training resource fraction

Datasets

  • reading-comprehension converted corpora
  • instruction-following seeds

Benchmarks

  • TRACE
  • CITB
  • StreamBench