Overview
The model is ready for deployment as an inference-time filter and shows consistent gains in F1 and jailbreak reduction across multiple benchmarks and ablations; however, synthetic-data limits and some gaps vs GPT‑4 on certain refusal tests remain.
Citations3
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
WILDGUARD gives teams an open, deployable moderator that matches closed APIs on many safety checks, reduces jailbreak risk sharply, and lowers reliance on expensive third‑party moderation services.
Who Should Care
Summary TLDR
WILDGUARD is an open-source, multi-task moderation system and dataset (WILDGUARDMIX) for LLM safety. It classifies harmful prompts, harmful responses, and whether a model refused. Trained on 92K balanced items (87K train + 5K human-annotated test), WILDGUARD attains state-of-the-art open-source performance, matches or slightly exceeds GPT-4 on some harmfulness tasks, and sharply reduces jailbreak success in a moderation pipeline (example: attack success falls from 79.8% to 2.4%). The project ships code and data for practical evaluation and deployment.
Problem Statement
Current open moderation tools miss many adversarial jailbreaking prompts and cannot reliably detect nuanced refusals in model outputs. This forces reliance on costly closed APIs and leaves safety gaps when testing or deploying LLMs.
Main Contribution
WILDGUARD: a unified, open moderator that labels prompt harmfulness, response harmfulness, and response refusal in one model.
WILDGUARDMIX: a large, balanced multi-task dataset with 92K labeled examples (WGTRAIN 86,759 train / WGTEST 5,299 human-annotated test) covering 13 risk subcategories and both vanilla and adversarial prompts.
Key Findings
WILDGUARD strongly improves refusal detection versus open baselines.
WILDGUARD matches or exceeds GPT-4 for some harmfulness tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Prompt harmfulness (avg F1 on public benchmarks) | 86.1% | GPT-4 84.6% | +1.8 pts | Public benchmarks (ToxicChat/OAI/Aegis/SimpST/HarmB) | Table 3 shows WILDGUARD 86.1 vs GPT-4 84.6 average F1 | Table 3, §4.2 |
| Prompt harmfulness (WGTEST total F1) | 88.9% | GPT-4 87.9% | +1.0 pts | WILDGUARDTEST | Table 4 WGTEST total prompt harmfulness F1 | Table 4, §4.1 |
What To Try In 7 Days
Run WILDGUARD as an inference-time filter in front of your assistant and measure ASR and benign RTA.
Score your model outputs with WILDGUARD to compare refusal and harm rates vs current tools (report F1 and ASR).
Evaluate WILDGUARD on a held-out set of adversarial prompts your team cares about and inspect false positives/negatives manually.
Optimization Features
Infra Optimization
Training Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Much of WGTRAIN is synthetic; real-world coverage is limited by in-the-wild sample size.
Refusal definitions and harm taxonomy are choices that may not align with every deployment's policy.
When Not To Use
If your moderation policy requires a different harm taxonomy than WILDGUARD's definitions.
When you need per-instance legal adjudication or human-only review for high-stakes decisions.
Failure Modes
False negatives on novel adversarial tactics not present in WGTRAIN.
False positives where a nuanced compliance contains caveats and is misclassified as refusal.

