A fast finetuning recipe that makes a large LLM 'forget' Harry Potter while keeping general skills

October 3, 20238 min

Overview

Decision SnapshotNeeds Validation

The technique is a pragmatic proof-of-concept with clear numeric drops in measured familiarity; it works well on idiosyncratic fiction but needs adversarial stress tests and careful anchor construction before production use.

Citations12

Evidence Strength0.70

Confidence0.82

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 40%

Novelty: 70%

Authors

Ronen Eldan, Mark Russinovich

Links

Abstract / PDF / Code

Why It Matters For Business

You can remove copyrighted or sensitive text from a large LLM with a short, targeted finetune instead of full retraining, cutting compute from hundreds of thousands of GPU-hours to minutes–hours for the targeted edit.

Who Should Care

Summary TLDR

The authors present an approximate unlearning method that mixes (1) a small reinforcement-style fine-tune on the target text, (2) automatic replacement of unique terms with generic anchors, and (3) fine-tuning on the model's own generic predictions. Applied to Llama2-7b to remove Harry Potter content, the method cuts a measured "familiarity" score from ~0.29 to ~0.007 while leaving standard benchmark accuracy nearly unchanged. The approach runs in minutes–hours of GPU finetuning and can remove strongly idiosyncratic content, but it may miss adversarial probes, rely on external LLMs for anchor extraction, and struggle on non-fiction or abstract concepts.

Problem Statement

Can we make a pretrained LLM selectively forget a specific corpus (copyrighted books) without retraining from scratch, using computation proportional to the size of the removed data?

Main Contribution

A practical three-part pipeline for approximate unlearning: (A) reinforce the model on the target to detect target-specific logits, (B) replace idiosyncratic terms with generic anchors and collect the baseline model's predictions, (C) fine-tune the baseline on those generic predictions to erase the specific links.

A proof-of-concept unlearning of the Harry Potter books from Llama2-7b that largely removes book-specific generations while keeping common benchmark performance.

Key Findings

The method dramatically reduces model 'familiarity' with Harry Potter as measured by completion-based tests.

NumbersFamiliarity (completion): 0.290.007 after ~120 finetuning steps

Practical UseYou can make the model stop producing explicit Harry Potter content on our prompts using a short finetune run instead of full retraining.

Evidence RefFigure 5; Section 4

General benchmark scores stay nearly the same after unlearning.

NumbersARC-challenge 0.440.414; BoolQ 0.8070.796; WinoGrande 0.6630.657

Practical UseThe model retains most language and reasoning abilities on standard tasks, so targeted forgetting does not heavily degrade general performance on these benchmarks.

Evidence RefFigure 2 and Figure 5 (benchmark rows)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Familiarity (completion-based)0.290.0070.29−0.283300 curated prompts (completion eval)Figure 5 completion rowFigure 5
Familiarity (probability-based)0.2440.0060.244−0.23830 next-token prompts (probability eval)Figure 5 probability rowFigure 5; Appendix 6.2.2

What To Try In 7 Days

Run a small-scale test: pick a short target corpus (≤3M tokens), generate anchored-term replacements, and run the two-stage finetune (reinforce then generic-label finetune) for a f

Build an anchor dictionary automatically (LLM-assisted) and also test simple n-gram frequency fallback to avoid external model dependence.

Open the model to adversarial probing (community or internal red-team) to surface leaks missed by canned prompts.

Optimization Features

Infra Optimization
runs in minutes–hours on GPUs; authors contrast ~1 GPU-hour vs pretraining ~184K GPU-hours
Training Optimization
short targeted finetuning (2 epochs on generic labels)reinforced fine-tune on unlearn target (3 epochs) to detect target logits

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Relies on distinctive names and idiosyncratic phrases; may struggle on non-fiction or abstract concepts.

Anchor extraction used GPT-4, introducing dependency on another large model and its knowledge.

When Not To Use

When legal proof of data deletion is required (method is approximate, not provable removal).

To remove abstract ideas or conceptual knowledge not tied to unique tokens.

Failure Modes

Residual 'wiki-level' knowledge leaks (e.g., school names) remain after finetuning.

Inconsistent replacements due to tokenization and mapping mismatch causing odd generations.

Core Entities

Models

Llama2-7b (Llama-7b-chat-hf)SFT

Metrics

Familiarity (completion-based)Familiarity (probability-based)Accuracy

Datasets

Harry Potter books (unlearn target, ~2.1M tokens)Synthetic discussions/blog/wiki about books (~1M tokens)Combined unlearn dataset (~3.1M tokens)

Benchmarks

ARC (challenge/easy)BoolQHellaSwagOpenBookQAPIQAWinoGrande