WMDP: a public 3,668-question benchmark plus RMU unlearning to measure and remove hazardous LLM knowledge

March 5, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

13

Authors

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Samuel Marks, Oam Patel, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Lin, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Ruoyu Wang, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks

Links

Abstract / PDF

Why It Matters For Business

WMDP + RMU let providers reduce hazardous knowledge in served models and demonstrate a practical mitigation that preserves most useful capabilities, lowering legal and reputational risk from malicious model use.

Summary TLDR

The authors release WMDP, a public benchmark of 3,668 multiple-choice questions that proxy hazardous knowledge across biosecurity, cybersecurity, and chemical security. They also propose RMU, a finetuning method that perturbs internal activations to 'unlearn' hazardous knowledge. RMU cuts model accuracy on WMDP from strong baselines to near-random on multiple open models while largely preserving MMLU and MT-Bench performance and resisting simple probes and an adversarial jailbreak. WMDP is filtered to remove especially sensitive items and is meant for closed-source unlearning and evaluation; it does not prevent relearning if model weights are released.

Problem Statement

There is no public, standard benchmark to measure hazardous knowledge in LLMs or to test methods that remove such knowledge. Private evaluations exist but are narrow and not reproducible. Model providers also lack validated tools to remove hazardous knowledge without breaking useful capabilities.

Main Contribution

WMDP dataset: 3,668 expert-written multiple-choice questions across biosecurity, cybersecurity, and chemistry; filtered to exclude export-controlled content.

RMU method: a representation-level finetuning loss that inflates activations on hazardous data and regularizes on benign data to remove hazardous knowledge.

Evaluation: RMU reduces WMDP accuracy to near-random on multiple open models while mostly preserving MMLU and MT-Bench and resisting linear probes and an adversarial jailbreak.

Open release: dataset, unlearning corpora, code, and instructions are publicly available at the project site.

Key Findings

WMDP is a sizable, vetted public benchmark for hazardous knowledge.

Numbers3,668 multiple-choice questions; development cost >$200K

RMU sharply reduces model QA accuracy on WMDP while keeping general knowledge.

NumbersZEPHYR-7B Bio: 63.7% → 31.2% (−32.5 pts); Cyber: 44.0% → 28.2% (−15.8 pts); MMLU: 58.1% → 57.1% (−1.0 pt)

RMU's reductions generalize across model scales.

NumbersYI-34B Bio: 75.3% → 30.7% (−44.6 pts); MIXTRAL-8X7B Bio: 74.8% → 34.0% (−40.8 pts)

Unlearned knowledge is hard to recover with linear probes and an adversarial optimizer.

NumbersLinear probes only slightly above random; GCG jailbreak recovered base model in <50 steps but failed on RMU after 2,500+

Unlearning can damage closely related, legitimate knowledge.

NumbersNotable drops on similar MMLU topics: virology and computer security saw large decreases

Results

Accuracy

Value31.2%

Baseline63.7% (base)

Accuracy

Value28.2%

Baseline44.0% (base)

Accuracy

Value57.1%

Baseline58.1% (base)

MT-Bench score (ZEPHYR-7B)

Value7.10

Baseline7.33 (base)

Accuracy

Value30.7%

Baseline75.3% (base)

Adversarial jailbreak steps to extract hazardous info

ValueRMU: >2,500 steps (failed); Base: <50 steps (succeeded)

Baselinebase model

Who Should Care

What To Try In 7 Days

Run WMDP evaluations on your models to quantify hazardous knowledge exposure.

Test RMU on a small dev model and measure WMDP vs MMLU/MT-Bench trade-offs.

Audit training and question sources against WMDP to find high-risk content for removal or controlled access.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • WMDP uses multiple-choice format only; it may miss hazards revealed in open-ended, multi-step generation.
  • Strict filtering removed especially sensitive questions; benchmark is a proxy, not comprehensive coverage of all hazardous knowledge.
  • Unlearning can degrade closely related legitimate knowledge (e.g., virology, computer security).
  • RMU does not prevent relearning if model weights are publicly released and adversaries finetune.

When Not To Use

  • When full scientific capability in a domain is required for defensive or research use without structured access.
  • If model weights will be publicly released and you cannot control downstream finetuning.
  • As the only safety measure—WMDP should complement red-teaming and access controls.

Failure Modes

  • Over-unlearning: removing benign, defensive knowledge adjacent to hazardous topics.
  • Relearning: public release of weights lets attackers finetune to recover hazardous behavior.
  • Benchmark blind spots: WMDP may not detect multi-step planning or emergent capabilities.

Core Entities

Models

  • ZEPHYR-7B
  • YI-34B
  • MIXTRAL-8X7B
  • MISTRAL-7B
  • GPT-4

Metrics

  • Accuracy
  • MT-Bench score
  • random baseline (25%)

Datasets

  • WMDP
  • WMDP-Bio
  • WMDP-Cyber
  • WMDP-Chem
  • MMLU
  • MT-Bench
  • Wikitext

Benchmarks

  • WMDP
  • MMLU
  • MT-Bench