RWKU: a stress test for forgetting real-world facts in LLMs using 200 real-person targets and adversarial probes

June 16, 20249 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

1

Authors

Zhuoran Jin, Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, Jun Zhao

Links

Abstract / PDF

Why It Matters For Business

If you need to remove personal, copyrighted, or risky facts from an LLM, RWKU shows current methods can fail under adversarial checks and batch deletions; auditing with adversarial probes and MIAs is necessary for compliance and risk management.

Summary TLDR

RWKU is a new benchmark that tests whether LLMs can ‘forget’ real-world factual knowledge. It provides 200 real-person targets, 13,131 forget probes (fill-in-the-blank, QA, adversarial), 11,379 neighbor probes, and four membership-inference attacks. The benchmark uses a practical zero-shot unlearning setup (no original forget/retain corpora provided) and shows current unlearning methods (in-context unlearning, gradient ascent, preference-based losses, rejection tuning, representation control) struggle to both erase facts and preserve nearby knowledge and overall model utility. Batch unlearning (many targets) is especially fragile and can cause model collapse. The dataset and code are public.

Problem Statement

LLMs memorize real-world facts that may need to be removed for privacy, copyright, or safety. Existing unlearning evaluations are limited (synthetic or require access to original training subsets). RWKU defines a practical zero-shot unlearning setting (only a target and model available) and builds a large, adversarial benchmark to measure if unlearning methods can actually erase targeted facts while keeping nearby knowledge and general abilities intact.

Main Contribution

A practical zero-shot unlearning benchmark (RWKU) with 200 real-world person targets and no access to original forget/retain corpora.

13,131 forget probes (3,268 cloze FB, 2,879 QA, 6,984 adversarial AA) plus 11,379 neighbor probes and a 6,198/7,487 MIA set for privacy testing.

An evaluation framework combining memorization tests (ROUGE-L), membership-inference attacks (four MIA metrics), nine adversarial jailbreak types, and utility tests (MMLU, BBH, TruthfulQA, TriviaQA, fluency).

Extensive experiments on LLaMA3-Instruct (8B) and Phi-3 Mini (3.8B) across six baseline unlearning methods and multiple fine-tuning styles (full, partial, LoRA).

Key Findings

Adversarial and cloze probes reveal forgotten facts more easily than standard QA probes.

NumbersLLaMA3 'All' forget ROUGE-L: Before 79.6 → ICU 12.8 (All types); FB and AA remain effective

Zero-shot synthetic forget data (model-generated) often yields stronger unlearning than wiki pseudo-corpora.

NumbersMethods trained on model-generated C_s_f outperform pseudo ground-truth C*_f in forget metrics

Most evaluated unlearning methods fail or show weak guarantees under membership-inference attacks.

NumbersAuthors report almost all methods trained on C_s_f fail MIA checks (high MIA leak remains)

Batch-target unlearning is far harder and can destabilize models at modest target counts.

NumbersModel collapse starts around 30 simultaneous targets in experiments

Trade-offs exist: erasing a target often harms neighboring facts and some utilities (truthfulness, fluency).

NumbersNeighbor set ‘All’ drops and utility metrics (e.g., TruthfulQA/fluency) degrade for several methods (see Table 1)

Partial-layer fine-tuning (early layers) can yield stronger forgetting with less neighbor damage.

NumbersPartial-layer experiments show better unlearning when updating early layers (Fig.7 commentary)

Simple methods perform relatively well: ICU (instruction-based) works best on LLaMA3; GA and NPO are competitive for parameter updates.

NumbersICU reduces LLaMA3 'All' forget ROUGE-L to 12.8 (Table 1); GA and NPO among best parameter-change methods

Results

Forget set (All types) ROUGE-L recall, LLaMA3

ValueBefore 79.6 → ICU 12.8

BaselineBefore

Neighbor set (All) ROUGE-L recall, LLaMA3

ValueBefore 90.7 → ICU 55.7

BaselineBefore

Batch-target stability threshold

ValueModel collapse starts at ~30 targets

Baselinesingle-target

Methods failing MIA (qualitative)

ValueMost methods trained on synthetic C_s_f fail membership inference checks

Baselineno unlearning

LoRA

ValueLoRA generally forgets less than full fine-tuning (less change on forget set)

BaselineFull fine-tuning

Who Should Care

What To Try In 7 Days

Run RWKU forget/neighbor probes on your model for 10 targets to baseline leakage.

Test in-context unlearning first, then try gradient-ascent and NPO with a small synthetic forget corpus.

Add adversarial probes (prefix injection, reverse query, affirmative suffix) to your audit checklist.

Optimization Features

Infra Optimization

  • Experiments run on eight A100 GPUs

Model Optimization

  • Partial-layer fine-tuning
  • LoRA

Training Optimization

  • Synthetic forget corpus generated by model
  • Low-epoch (2–4) fine-tuning for unlearning

Reproducibility

License

  • CC-BY-4.0 (per paper)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Targets limited to 200 famous people (entity facts); other knowledge types (events, concepts) not covered.
  • No error bars reported; experiments use fixed seeds and average over 100 targets but reproducibility across runs may vary.
  • MIAs and some metrics are heuristic — passing them is necessary but not sufficient for formal deletion guarantees.

When Not To Use

  • When the removal target is a non-entity concept or skill (RWKU focuses on factual entities).
  • If you require provable, cryptographic deletion guarantees — RWKU evaluates empirical robustness, not formal erasure.
  • For very large-scale deletion (hundreds+ targets) without stability validation — batch unlearning can collapse models.

Failure Modes

  • Adversarial jailbreaks (prefix injection, reverse queries) can still elicit 'forgotten' facts.
  • Batch unlearning causing catastrophic model collapse around ~30 targets.
  • Collateral forgetting of neighboring facts and degradation of truthfulness or fluency.

Core Entities

Models

  • LLaMA3-Instruct (8B)
  • Phi-3 Mini-4K-Instruct (3.8B)
  • LLaMA2-Chat (7B)
  • Mistral-Instruct-v0.2 (7B)

Metrics

  • ROUGE-L recall
  • LOSS (MIA)
  • Zlib entropy (MIA)
  • Min-K% Prob (MIA)
  • Exact Memorization (EM)
  • NLL
  • Accuracy
  • BBH EM
  • TriviaQA F1
  • Fluency entropy

Datasets

  • RWKU (this paper)
  • MMLU
  • BBH
  • TruthfulQA (MC1)
  • TriviaQA
  • AlpacaEval

Benchmarks

  • RWKU