WILDGUARD: open multi-task moderator that matches GPT‑4 and cuts jailbreak success to near zero

June 26, 20249 min

Overview

Decision SnapshotReady For Pilot

The model is ready for deployment as an inference-time filter and shows consistent gains in F1 and jailbreak reduction across multiple benchmarks and ablations; however, synthetic-data limits and some gaps vs GPT‑4 on certain refusal tests remain.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri

Links

Abstract / PDF / Code / Data

Why It Matters For Business

WILDGUARD gives teams an open, deployable moderator that matches closed APIs on many safety checks, reduces jailbreak risk sharply, and lowers reliance on expensive third‑party moderation services.

Who Should Care

Summary TLDR

WILDGUARD is an open-source, multi-task moderation system and dataset (WILDGUARDMIX) for LLM safety. It classifies harmful prompts, harmful responses, and whether a model refused. Trained on 92K balanced items (87K train + 5K human-annotated test), WILDGUARD attains state-of-the-art open-source performance, matches or slightly exceeds GPT-4 on some harmfulness tasks, and sharply reduces jailbreak success in a moderation pipeline (example: attack success falls from 79.8% to 2.4%). The project ships code and data for practical evaluation and deployment.

Problem Statement

Current open moderation tools miss many adversarial jailbreaking prompts and cannot reliably detect nuanced refusals in model outputs. This forces reliance on costly closed APIs and leaves safety gaps when testing or deploying LLMs.

Main Contribution

WILDGUARD: a unified, open moderator that labels prompt harmfulness, response harmfulness, and response refusal in one model.

WILDGUARDMIX: a large, balanced multi-task dataset with 92K labeled examples (WGTRAIN 86,759 train / WGTEST 5,299 human-annotated test) covering 13 risk subcategories and both vanilla and adversarial prompts.

Key Findings

WILDGUARD strongly improves refusal detection versus open baselines.

NumbersRefusal F1 +26.4 pts vs LibrAI-LongFormer-ref on WGTEST/XSTEST-RESP

Practical UseIf you need accurate detection of whether a model refused, deploy WILDGUARD instead of existing open refusal classifiers to avoid under- or over-counting refusals.

Evidence RefTable 4, Table 20; §4.2

WILDGUARD matches or exceeds GPT-4 for some harmfulness tasks.

NumbersPrompt harmfulness: +1.8 pts avg F1 on public benchmarks; +3.9 pts on adversarial subset vs GPT-4

Practical UseYou can use WILDGUARD as a lower-cost, open alternative to GPT-4 for prompt harmfulness evaluations, especially on adversarial inputs.

Evidence RefTable 3, §4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Prompt harmfulness (avg F1 on public benchmarks)86.1%GPT-4 84.6%+1.8 ptsPublic benchmarks (ToxicChat/OAI/Aegis/SimpST/HarmB)Table 3 shows WILDGUARD 86.1 vs GPT-4 84.6 average F1Table 3, §4.2
Prompt harmfulness (WGTEST total F1)88.9%GPT-4 87.9%+1.0 ptsWILDGUARDTESTTable 4 WGTEST total prompt harmfulness F1Table 4, §4.1

What To Try In 7 Days

Run WILDGUARD as an inference-time filter in front of your assistant and measure ASR and benign RTA.

Score your model outputs with WILDGUARD to compare refusal and harm rates vs current tools (report F1 and ASR).

Evaluate WILDGUARD on a held-out set of adversarial prompts your team cares about and inspect false positives/negatives manually.

Optimization Features

Infra Optimization
Training done on 4x A100 80GB in ~5 hours
Training Optimization
Instruction-tuning Mistral-7B-v0.3 on WILDGUARDTRAIN (2 epochs, lr 2e-6)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Much of WGTRAIN is synthetic; real-world coverage is limited by in-the-wild sample size.

Refusal definitions and harm taxonomy are choices that may not align with every deployment's policy.

When Not To Use

If your moderation policy requires a different harm taxonomy than WILDGUARD's definitions.

When you need per-instance legal adjudication or human-only review for high-stakes decisions.

Failure Modes

False negatives on novel adversarial tactics not present in WGTRAIN.

False positives where a nuanced compliance contains caveats and is misclassified as refusal.

Core Entities

Models

WILDGUARD (instruction-tuned Mistral-7B-v0.3)Llama-Guard2Aegis-Guard (Defensive & Permissive)MD-JudgeBeaverDamLibrAI-LongFormerHarmBench classifiersGPT-4 (gpt-4-0125-preview)OpenAI Moderation API

Metrics

F1 (prompt harmfulness, response harmfulness, refusal detection)Attack Success Rate (ASR)Refusal To Answer (RTA)Fleiss Kappa (annotation agreement)

Datasets

WILDGUARDMIXWILDGUARDTRAIN (86,759 items)WILDGUARDTEST (5,299 human-annotated)WILDJAILBREAK (WJ)XSTEST-RESPToxicChatHarmBench (prompt & response)BeaverTailsSafeRLHFLMSYS-CHAT-1MWILDCHAT

Benchmarks

WILDGUARDTESTXSTEST-RESPToxicChatHarmBenchBeaverTailsSafeRLHFAegisSafetyTestSimpleSafetyTestsOpenAI Moderation dataset