Overview
The dataset fills a clear public gap and RMU shows repeatable reductions on multiple models; however, unlearning harms related topics and does not stop relearning from released weights, so deployment needs policy and access controls.
Citations13
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
WMDP + RMU let providers reduce hazardous knowledge in served models and demonstrate a practical mitigation that preserves most useful capabilities, lowering legal and reputational risk from malicious model use.
Who Should Care
Summary TLDR
The authors release WMDP, a public benchmark of 3,668 multiple-choice questions that proxy hazardous knowledge across biosecurity, cybersecurity, and chemical security. They also propose RMU, a finetuning method that perturbs internal activations to 'unlearn' hazardous knowledge. RMU cuts model accuracy on WMDP from strong baselines to near-random on multiple open models while largely preserving MMLU and MT-Bench performance and resisting simple probes and an adversarial jailbreak. WMDP is filtered to remove especially sensitive items and is meant for closed-source unlearning and evaluation; it does not prevent relearning if model weights are released.
Problem Statement
There is no public, standard benchmark to measure hazardous knowledge in LLMs or to test methods that remove such knowledge. Private evaluations exist but are narrow and not reproducible. Model providers also lack validated tools to remove hazardous knowledge without breaking useful capabilities.
Main Contribution
WMDP dataset: 3,668 expert-written multiple-choice questions across biosecurity, cybersecurity, and chemistry; filtered to exclude export-controlled content.
RMU method: a representation-level finetuning loss that inflates activations on hazardous data and regularizes on benign data to remove hazardous knowledge.
Key Findings
WMDP is a sizable, vetted public benchmark for hazardous knowledge.
RMU sharply reduces model QA accuracy on WMDP while keeping general knowledge.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 31.2% | 63.7% (base) | −32.5 pts | WMDP-Bio | Table 1: ZEPHYR-7B + RMU | — |
| Accuracy | 28.2% | 44.0% (base) | −15.8 pts | WMDP-Cyber | Table 1: ZEPHYR-7B + RMU | — |
What To Try In 7 Days
Run WMDP evaluations on your models to quantify hazardous knowledge exposure.
Test RMU on a small dev model and measure WMDP vs MMLU/MT-Bench trade-offs.
Audit training and question sources against WMDP to find high-risk content for removal or controlled access.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
WMDP uses multiple-choice format only; it may miss hazards revealed in open-ended, multi-step generation.
Strict filtering removed especially sensitive questions; benchmark is a proxy, not comprehensive coverage of all hazardous knowledge.
When Not To Use
When full scientific capability in a domain is required for defensive or research use without structured access.
If model weights will be publicly released and you cannot control downstream finetuning.
Failure Modes
Over-unlearning: removing benign, defensive knowledge adjacent to hazardous topics.
Relearning: public release of weights lets attackers finetune to recover hazardous behavior.

