A fast finetuning recipe that makes a large LLM 'forget' Harry Potter while keeping general skills

October 3, 20238 min

Overview

Production Readiness

0.4

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

12

Authors

Ronen Eldan, Mark Russinovich

Links

Abstract / PDF

Why It Matters For Business

You can remove copyrighted or sensitive text from a large LLM with a short, targeted finetune instead of full retraining, cutting compute from hundreds of thousands of GPU-hours to minutes–hours for the targeted edit.

Summary TLDR

The authors present an approximate unlearning method that mixes (1) a small reinforcement-style fine-tune on the target text, (2) automatic replacement of unique terms with generic anchors, and (3) fine-tuning on the model's own generic predictions. Applied to Llama2-7b to remove Harry Potter content, the method cuts a measured "familiarity" score from ~0.29 to ~0.007 while leaving standard benchmark accuracy nearly unchanged. The approach runs in minutes–hours of GPU finetuning and can remove strongly idiosyncratic content, but it may miss adversarial probes, rely on external LLMs for anchor extraction, and struggle on non-fiction or abstract concepts.

Problem Statement

Can we make a pretrained LLM selectively forget a specific corpus (copyrighted books) without retraining from scratch, using computation proportional to the size of the removed data?

Main Contribution

A practical three-part pipeline for approximate unlearning: (A) reinforce the model on the target to detect target-specific logits, (B) replace idiosyncratic terms with generic anchors and collect the baseline model's predictions, (C) fine-tune the baseline on those generic predictions to erase the specific links.

A proof-of-concept unlearning of the Harry Potter books from Llama2-7b that largely removes book-specific generations while keeping common benchmark performance.

A public release of the fine-tuned model and an ablation study showing both reinforcement and anchored-term steps are needed for best trade-offs.

Key Findings

The method dramatically reduces model 'familiarity' with Harry Potter as measured by completion-based tests.

NumbersFamiliarity (completion): 0.29 → 0.007 after ~120 finetuning steps

General benchmark scores stay nearly the same after unlearning.

NumbersARC-challenge 0.44→0.414; BoolQ 0.807→0.796; WinoGrande 0.663→0.657

Both ingredients (reinforcement bootstrapping + anchored-term genericization) are required for the best result and trade-off.

NumbersReinforcement alone reduced familiarity by ≤×0.3; anchored-only matched familiarity but reduced benchmarks (ARC-chal 0.4

Results

Familiarity (completion-based)

Value0.29 → 0.007

Baseline0.29

Familiarity (probability-based)

Value0.244 → 0.006

Baseline0.244

Accuracy

Value0.44 → 0.414

Baseline0.44

Accuracy

Value0.807 → 0.796

Baseline0.807

Accuracy

Value0.663 → 0.657

Baseline0.663

Who Should Care

What To Try In 7 Days

Run a small-scale test: pick a short target corpus (≤3M tokens), generate anchored-term replacements, and run the two-stage finetune (reinforce then generic-label finetune) for a f

Build an anchor dictionary automatically (LLM-assisted) and also test simple n-gram frequency fallback to avoid external model dependence.

Open the model to adversarial probing (community or internal red-team) to surface leaks missed by canned prompts.

Optimization Features

Infra Optimization

  • runs in minutes–hours on GPUs; authors contrast ~1 GPU-hour vs pretraining ~184K GPU-hours

Training Optimization

  • short targeted finetuning (2 epochs on generic labels)
  • reinforced fine-tune on unlearn target (3 epochs) to detect target logits

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on distinctive names and idiosyncratic phrases; may struggle on non-fiction or abstract concepts.
  • Anchor extraction used GPT-4, introducing dependency on another large model and its knowledge.
  • Evaluation is prompt-based and may miss adversarial extraction methods or subtle residual knowledge.
  • Method can unintentionally remove related external knowledge (e.g., Wikipedia-level info) unless re-finetuned.

When Not To Use

  • When legal proof of data deletion is required (method is approximate, not provable removal).
  • To remove abstract ideas or conceptual knowledge not tied to unique tokens.
  • At scale for many diverse targets without careful testing—risk of collateral forgetting.

Failure Modes

  • Residual 'wiki-level' knowledge leaks (e.g., school names) remain after finetuning.
  • Inconsistent replacements due to tokenization and mapping mismatch causing odd generations.
  • Overcorrection causing loss of general performance if anchors or hyperparameters are mischosen.
  • Unlearning a superset of related content (removes more than the intended target) unless re-tuned.

Core Entities

Models

  • Llama2-7b (Llama-7b-chat-hf)
  • SFT

Metrics

  • Familiarity (completion-based)
  • Familiarity (probability-based)
  • Accuracy

Datasets

  • Harry Potter books (unlearn target, ~2.1M tokens)
  • Synthetic discussions/blog/wiki about books (~1M tokens)
  • Combined unlearn dataset (~3.1M tokens)

Benchmarks

  • ARC (challenge/easy)
  • BoolQ
  • HellaSwag
  • OpenBookQA
  • PIQA
  • WinoGrande