WMDP: a public 3,668-question benchmark plus RMU unlearning to measure and remove hazardous LLM knowledge

March 5, 20248 min

Overview

Decision SnapshotNeeds Validation

The dataset fills a clear public gap and RMU shows repeatable reductions on multiple models; however, unlearning harms related topics and does not stop relearning from released weights, so deployment needs policy and access controls.

Citations13

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Samuel Marks, Oam Patel, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Lin, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Ruoyu Wang, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks

Links

Abstract / PDF / Code / Data

Why It Matters For Business

WMDP + RMU let providers reduce hazardous knowledge in served models and demonstrate a practical mitigation that preserves most useful capabilities, lowering legal and reputational risk from malicious model use.

Who Should Care

Summary TLDR

The authors release WMDP, a public benchmark of 3,668 multiple-choice questions that proxy hazardous knowledge across biosecurity, cybersecurity, and chemical security. They also propose RMU, a finetuning method that perturbs internal activations to 'unlearn' hazardous knowledge. RMU cuts model accuracy on WMDP from strong baselines to near-random on multiple open models while largely preserving MMLU and MT-Bench performance and resisting simple probes and an adversarial jailbreak. WMDP is filtered to remove especially sensitive items and is meant for closed-source unlearning and evaluation; it does not prevent relearning if model weights are released.

Problem Statement

There is no public, standard benchmark to measure hazardous knowledge in LLMs or to test methods that remove such knowledge. Private evaluations exist but are narrow and not reproducible. Model providers also lack validated tools to remove hazardous knowledge without breaking useful capabilities.

Main Contribution

WMDP dataset: 3,668 expert-written multiple-choice questions across biosecurity, cybersecurity, and chemistry; filtered to exclude export-controlled content.

RMU method: a representation-level finetuning loss that inflates activations on hazardous data and regularizes on benign data to remove hazardous knowledge.

Key Findings

WMDP is a sizable, vetted public benchmark for hazardous knowledge.

Numbers3,668 multiple-choice questions; development cost >$200K

Practical UseUse WMDP to measure hazardous knowledge and to benchmark unlearning methods before deployment.

Evidence RefAbstract; Section 3

RMU sharply reduces model QA accuracy on WMDP while keeping general knowledge.

NumbersZEPHYR-7B Bio: 63.7%31.2% (−32.5 pts); Cyber: 44.0%28.2% (−15.8 pts); MMLU: 58.1%57.1% (−1.0 pt)

Practical UseApply RMU to lower hazardous-question performance without heavily degrading many downstream tasks.

Evidence RefTable 1; Figure 8

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy31.2%63.7% (base)−32.5 ptsWMDP-BioTable 1: ZEPHYR-7B + RMU
Accuracy28.2%44.0% (base)−15.8 ptsWMDP-CyberTable 1: ZEPHYR-7B + RMU

What To Try In 7 Days

Run WMDP evaluations on your models to quantify hazardous knowledge exposure.

Test RMU on a small dev model and measure WMDP vs MMLU/MT-Bench trade-offs.

Audit training and question sources against WMDP to find high-risk content for removal or controlled access.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

WMDP uses multiple-choice format only; it may miss hazards revealed in open-ended, multi-step generation.

Strict filtering removed especially sensitive questions; benchmark is a proxy, not comprehensive coverage of all hazardous knowledge.

When Not To Use

When full scientific capability in a domain is required for defensive or research use without structured access.

If model weights will be publicly released and you cannot control downstream finetuning.

Failure Modes

Over-unlearning: removing benign, defensive knowledge adjacent to hazardous topics.

Relearning: public release of weights lets attackers finetune to recover hazardous behavior.

Core Entities

Models

ZEPHYR-7BYI-34BMIXTRAL-8X7BMISTRAL-7BGPT-4

Metrics

AccuracyMT-Bench scorerandom baseline (25%)

Datasets

WMDPWMDP-BioWMDP-CyberWMDP-ChemMMLUMT-BenchWikitext

Benchmarks

WMDPMMLUMT-Bench