WMDP: a public 3,668-question benchmark plus RMU unlearning to measure and remove hazardous LLM knowledge

Overview

Decision SnapshotNeeds Validation

The dataset fills a clear public gap and RMU shows repeatable reductions on multiple models; however, unlearning harms related topics and does not stop relearning from released weights, so deployment needs policy and access controls.

Citations13

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Samuel Marks, Oam Patel, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Lin, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Ruoyu Wang, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks

Links

Abstract / PDF / Code / Data

Why It Matters For Business

WMDP + RMU let providers reduce hazardous knowledge in served models and demonstrate a practical mitigation that preserves most useful capabilities, lowering legal and reputational risk from malicious model use.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Founder

Summary TLDR

The authors release WMDP, a public benchmark of 3,668 multiple-choice questions that proxy hazardous knowledge across biosecurity, cybersecurity, and chemical security. They also propose RMU, a finetuning method that perturbs internal activations to 'unlearn' hazardous knowledge. RMU cuts model accuracy on WMDP from strong baselines to near-random on multiple open models while largely preserving MMLU and MT-Bench performance and resisting simple probes and an adversarial jailbreak. WMDP is filtered to remove especially sensitive items and is meant for closed-source unlearning and evaluation; it does not prevent relearning if model weights are released.

Problem Statement

There is no public, standard benchmark to measure hazardous knowledge in LLMs or to test methods that remove such knowledge. Private evaluations exist but are narrow and not reproducible. Model providers also lack validated tools to remove hazardous knowledge without breaking useful capabilities.

Main Contribution

WMDP dataset: 3,668 expert-written multiple-choice questions across biosecurity, cybersecurity, and chemistry; filtered to exclude export-controlled content.

RMU method: a representation-level finetuning loss that inflates activations on hazardous data and regularizes on benign data to remove hazardous knowledge.

Key Findings

WMDP is a sizable, vetted public benchmark for hazardous knowledge.

Numbers3,668 multiple-choice questions; development cost >$200K

Practical UseUse WMDP to measure hazardous knowledge and to benchmark unlearning methods before deployment.

Evidence RefAbstract; Section 3

RMU sharply reduces model QA accuracy on WMDP while keeping general knowledge.

NumbersZEPHYR-7B Bio: 63.7% → 31.2% (−32.5 pts); Cyber: 44.0% → 28.2% (−15.8 pts); MMLU: 58.1% → 57.1% (−1.0 pt)

Practical UseApply RMU to lower hazardous-question performance without heavily degrading many downstream tasks.

Evidence RefTable 1; Figure 8

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	31.2%	63.7% (base)	−32.5 pts	WMDP-Bio	Table 1: ZEPHYR-7B + RMU	—
Accuracy	28.2%	44.0% (base)	−15.8 pts	WMDP-Cyber	Table 1: ZEPHYR-7B + RMU	—

What To Try In 7 Days

Run WMDP evaluations on your models to quantify hazardous knowledge exposure.

Test RMU on a small dev model and measure WMDP vs MMLU/MT-Bench trade-offs.

Audit training and question sources against WMDP to find high-risk content for removal or controlled access.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://wmdp.ai https://arxiv.org/abs/2403.03218

Data URLs

https://wmdp.ai

Risks & Boundaries

Limitations

WMDP uses multiple-choice format only; it may miss hazards revealed in open-ended, multi-step generation.

Strict filtering removed especially sensitive questions; benchmark is a proxy, not comprehensive coverage of all hazardous knowledge.

When Not To Use

When full scientific capability in a domain is required for defensive or research use without structured access.

If model weights will be publicly released and you cannot control downstream finetuning.

Failure Modes

Over-unlearning: removing benign, defensive knowledge adjacent to hazardous topics.

Relearning: public release of weights lets attackers finetune to recover hazardous behavior.

Core Entities

Models

ZEPHYR-7BYI-34BMIXTRAL-8X7BMISTRAL-7BGPT-4

Metrics

AccuracyMT-Bench scorerandom baseline (25%)

Datasets

WMDPWMDP-BioWMDP-CyberWMDP-ChemMMLUMT-BenchWikitext

Benchmarks

WMDPMMLUMT-Bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

WMDP is a sizable, vetted public benchmark for hazardous knowledge.

RMU sharply reduces model QA accuracy on WMDP while keeping general knowledge.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding