Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
13
Why It Matters For Business
WMDP + RMU let providers reduce hazardous knowledge in served models and demonstrate a practical mitigation that preserves most useful capabilities, lowering legal and reputational risk from malicious model use.
Summary TLDR
The authors release WMDP, a public benchmark of 3,668 multiple-choice questions that proxy hazardous knowledge across biosecurity, cybersecurity, and chemical security. They also propose RMU, a finetuning method that perturbs internal activations to 'unlearn' hazardous knowledge. RMU cuts model accuracy on WMDP from strong baselines to near-random on multiple open models while largely preserving MMLU and MT-Bench performance and resisting simple probes and an adversarial jailbreak. WMDP is filtered to remove especially sensitive items and is meant for closed-source unlearning and evaluation; it does not prevent relearning if model weights are released.
Problem Statement
There is no public, standard benchmark to measure hazardous knowledge in LLMs or to test methods that remove such knowledge. Private evaluations exist but are narrow and not reproducible. Model providers also lack validated tools to remove hazardous knowledge without breaking useful capabilities.
Main Contribution
WMDP dataset: 3,668 expert-written multiple-choice questions across biosecurity, cybersecurity, and chemistry; filtered to exclude export-controlled content.
RMU method: a representation-level finetuning loss that inflates activations on hazardous data and regularizes on benign data to remove hazardous knowledge.
Evaluation: RMU reduces WMDP accuracy to near-random on multiple open models while mostly preserving MMLU and MT-Bench and resisting linear probes and an adversarial jailbreak.
Open release: dataset, unlearning corpora, code, and instructions are publicly available at the project site.
Key Findings
WMDP is a sizable, vetted public benchmark for hazardous knowledge.
RMU sharply reduces model QA accuracy on WMDP while keeping general knowledge.
RMU's reductions generalize across model scales.
Unlearned knowledge is hard to recover with linear probes and an adversarial optimizer.
Unlearning can damage closely related, legitimate knowledge.
Results
Accuracy
Accuracy
Accuracy
MT-Bench score (ZEPHYR-7B)
Accuracy
Adversarial jailbreak steps to extract hazardous info
Who Should Care
What To Try In 7 Days
Run WMDP evaluations on your models to quantify hazardous knowledge exposure.
Test RMU on a small dev model and measure WMDP vs MMLU/MT-Bench trade-offs.
Audit training and question sources against WMDP to find high-risk content for removal or controlled access.
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- WMDP uses multiple-choice format only; it may miss hazards revealed in open-ended, multi-step generation.
- Strict filtering removed especially sensitive questions; benchmark is a proxy, not comprehensive coverage of all hazardous knowledge.
- Unlearning can degrade closely related legitimate knowledge (e.g., virology, computer security).
- RMU does not prevent relearning if model weights are publicly released and adversaries finetune.
When Not To Use
- When full scientific capability in a domain is required for defensive or research use without structured access.
- If model weights will be publicly released and you cannot control downstream finetuning.
- As the only safety measure—WMDP should complement red-teaming and access controls.
Failure Modes
- Over-unlearning: removing benign, defensive knowledge adjacent to hazardous topics.
- Relearning: public release of weights lets attackers finetune to recover hazardous behavior.
- Benchmark blind spots: WMDP may not detect multi-step planning or emergent capabilities.
Core Entities
Models
- ZEPHYR-7B
- YI-34B
- MIXTRAL-8X7B
- MISTRAL-7B
- GPT-4
Metrics
- Accuracy
- MT-Bench score
- random baseline (25%)
Datasets
- WMDP
- WMDP-Bio
- WMDP-Cyber
- WMDP-Chem
- MMLU
- MT-Bench
- Wikitext
Benchmarks
- WMDP
- MMLU
- MT-Bench

