Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
3
Why It Matters For Business
WILDGUARD gives teams an open, deployable moderator that matches closed APIs on many safety checks, reduces jailbreak risk sharply, and lowers reliance on expensive third‑party moderation services.
Summary TLDR
WILDGUARD is an open-source, multi-task moderation system and dataset (WILDGUARDMIX) for LLM safety. It classifies harmful prompts, harmful responses, and whether a model refused. Trained on 92K balanced items (87K train + 5K human-annotated test), WILDGUARD attains state-of-the-art open-source performance, matches or slightly exceeds GPT-4 on some harmfulness tasks, and sharply reduces jailbreak success in a moderation pipeline (example: attack success falls from 79.8% to 2.4%). The project ships code and data for practical evaluation and deployment.
Problem Statement
Current open moderation tools miss many adversarial jailbreaking prompts and cannot reliably detect nuanced refusals in model outputs. This forces reliance on costly closed APIs and leaves safety gaps when testing or deploying LLMs.
Main Contribution
WILDGUARD: a unified, open moderator that labels prompt harmfulness, response harmfulness, and response refusal in one model.
WILDGUARDMIX: a large, balanced multi-task dataset with 92K labeled examples (WGTRAIN 86,759 train / WGTEST 5,299 human-annotated test) covering 13 risk subcategories and both vanilla and adversarial prompts.
Empirical gains: WILDGUARD outperforms prior open tools across public benchmarks and WGTEST, and equals or sometimes beats GPT-4 on prompt harmfulness.
Practical demo: used as an inference-time filter, WILDGUARD reduces jailbreak Attack Success Rate (ASR) drastically with minimal over-refusal.
Open release: model code and dataset published (GitHub and Hugging Face) for reuse and audit.
Key Findings
WILDGUARD strongly improves refusal detection versus open baselines.
WILDGUARD matches or exceeds GPT-4 for some harmfulness tasks.
WILDGUARD cuts jailbreak success in a moderation pipeline to a small fraction.
Training data scale and mix matter: diverse sources drive performance.
Multi-task training helps most tasks vs single-task models.
Results
Prompt harmfulness (avg F1 on public benchmarks)
Prompt harmfulness (WGTEST total F1)
Response harmfulness (avg F1 on public benchmarks)
Refusal detection (XSTEST-RESP F1)
Refusal detection (WGTEST total F1)
Jailbreak Attack Success Rate (ASR) in moderation demo
Who Should Care
What To Try In 7 Days
Run WILDGUARD as an inference-time filter in front of your assistant and measure ASR and benign RTA.
Score your model outputs with WILDGUARD to compare refusal and harm rates vs current tools (report F1 and ASR).
Evaluate WILDGUARD on a held-out set of adversarial prompts your team cares about and inspect false positives/negatives manually.
Optimization Features
Infra Optimization
- Training done on 4x A100 80GB in ~5 hours
Training Optimization
- Instruction-tuning Mistral-7B-v0.3 on WILDGUARDTRAIN (2 epochs, lr 2e-6)
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Much of WGTRAIN is synthetic; real-world coverage is limited by in-the-wild sample size.
- Refusal definitions and harm taxonomy are choices that may not align with every deployment's policy.
- Not a fine-grained harm classifier (no detailed harm subcategory labels per response).
- On some refusal benchmarks GPT‑4 still scores higher; edge cases remain.
When Not To Use
- If your moderation policy requires a different harm taxonomy than WILDGUARD's definitions.
- When you need per-instance legal adjudication or human-only review for high-stakes decisions.
- If you need fine-grained category labels (WILDGUARD focuses on three binary tasks).
Failure Modes
- False negatives on novel adversarial tactics not present in WGTRAIN.
- False positives where a nuanced compliance contains caveats and is misclassified as refusal.
- Labeling mismatch when an organization’s harm definitions differ from the dataset's.
Core Entities
Models
- WILDGUARD (instruction-tuned Mistral-7B-v0.3)
- Llama-Guard2
- Aegis-Guard (Defensive & Permissive)
- MD-Judge
- BeaverDam
- LibrAI-LongFormer
- HarmBench classifiers
- GPT-4 (gpt-4-0125-preview)
- OpenAI Moderation API
Metrics
- F1 (prompt harmfulness, response harmfulness, refusal detection)
- Attack Success Rate (ASR)
- Refusal To Answer (RTA)
- Fleiss Kappa (annotation agreement)
Datasets
- WILDGUARDMIX
- WILDGUARDTRAIN (86,759 items)
- WILDGUARDTEST (5,299 human-annotated)
- WILDJAILBREAK (WJ)
- XSTEST-RESP
- ToxicChat
- HarmBench (prompt & response)
- BeaverTails
- SafeRLHF
- LMSYS-CHAT-1M
- WILDCHAT
Benchmarks
- WILDGUARDTEST
- XSTEST-RESP
- ToxicChat
- HarmBench
- BeaverTails
- SafeRLHF
- AegisSafetyTest
- SimpleSafetyTests
- OpenAI Moderation dataset

