WILDGUARD: open multi-task moderator that matches GPT‑4 and cuts jailbreak success to near zero

Overview

Decision SnapshotReady For Pilot

The model is ready for deployment as an inference-time filter and shows consistent gains in F1 and jailbreak reduction across multiple benchmarks and ablations; however, synthetic-data limits and some gaps vs GPT‑4 on certain refusal tests remain.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri

Links

Abstract / PDF / Code / Data

Why It Matters For Business

WILDGUARD gives teams an open, deployable moderator that matches closed APIs on many safety checks, reduces jailbreak risk sharply, and lowers reliance on expensive third‑party moderation services.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead Data Scientist

Summary TLDR

WILDGUARD is an open-source, multi-task moderation system and dataset (WILDGUARDMIX) for LLM safety. It classifies harmful prompts, harmful responses, and whether a model refused. Trained on 92K balanced items (87K train + 5K human-annotated test), WILDGUARD attains state-of-the-art open-source performance, matches or slightly exceeds GPT-4 on some harmfulness tasks, and sharply reduces jailbreak success in a moderation pipeline (example: attack success falls from 79.8% to 2.4%). The project ships code and data for practical evaluation and deployment.

Problem Statement

Current open moderation tools miss many adversarial jailbreaking prompts and cannot reliably detect nuanced refusals in model outputs. This forces reliance on costly closed APIs and leaves safety gaps when testing or deploying LLMs.

Main Contribution

WILDGUARD: a unified, open moderator that labels prompt harmfulness, response harmfulness, and response refusal in one model.

WILDGUARDMIX: a large, balanced multi-task dataset with 92K labeled examples (WGTRAIN 86,759 train / WGTEST 5,299 human-annotated test) covering 13 risk subcategories and both vanilla and adversarial prompts.

Key Findings

WILDGUARD strongly improves refusal detection versus open baselines.

NumbersRefusal F1 +26.4 pts vs LibrAI-LongFormer-ref on WGTEST/XSTEST-RESP

Practical UseIf you need accurate detection of whether a model refused, deploy WILDGUARD instead of existing open refusal classifiers to avoid under- or over-counting refusals.

Evidence RefTable 4, Table 20; §4.2

WILDGUARD matches or exceeds GPT-4 for some harmfulness tasks.

NumbersPrompt harmfulness: +1.8 pts avg F1 on public benchmarks; +3.9 pts on adversarial subset vs GPT-4

Practical UseYou can use WILDGUARD as a lower-cost, open alternative to GPT-4 for prompt harmfulness evaluations, especially on adversarial inputs.

Evidence RefTable 3, §4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Prompt harmfulness (avg F1 on public benchmarks)	86.1%	GPT-4 84.6%	+1.8 pts	Public benchmarks (ToxicChat/OAI/Aegis/SimpST/HarmB)	Table 3 shows WILDGUARD 86.1 vs GPT-4 84.6 average F1	Table 3, §4.2
Prompt harmfulness (WGTEST total F1)	88.9%	GPT-4 87.9%	+1.0 pts	WILDGUARDTEST	Table 4 WGTEST total prompt harmfulness F1	Table 4, §4.1

What To Try In 7 Days

Run WILDGUARD as an inference-time filter in front of your assistant and measure ASR and benign RTA.

Score your model outputs with WILDGUARD to compare refusal and harm rates vs current tools (report F1 and ASR).

Evaluate WILDGUARD on a held-out set of adversarial prompts your team cares about and inspect false positives/negatives manually.

Optimization Features

Infra Optimization

Training done on 4x A100 80GB in ~5 hours

Training Optimization

Instruction-tuning Mistral-7B-v0.3 on WILDGUARDTRAIN (2 epochs, lr 2e-6)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/allenai/wildguard

Data URLs

https://huggingface.co/datasets/allenai/wildguardmix

Risks & Boundaries

Limitations

Much of WGTRAIN is synthetic; real-world coverage is limited by in-the-wild sample size.

Refusal definitions and harm taxonomy are choices that may not align with every deployment's policy.

When Not To Use

If your moderation policy requires a different harm taxonomy than WILDGUARD's definitions.

When you need per-instance legal adjudication or human-only review for high-stakes decisions.

Failure Modes

False negatives on novel adversarial tactics not present in WGTRAIN.

False positives where a nuanced compliance contains caveats and is misclassified as refusal.

Core Entities

Models

WILDGUARD (instruction-tuned Mistral-7B-v0.3)Llama-Guard2Aegis-Guard (Defensive & Permissive)MD-JudgeBeaverDamLibrAI-LongFormerHarmBench classifiersGPT-4 (gpt-4-0125-preview)OpenAI Moderation API

Metrics

F1 (prompt harmfulness, response harmfulness, refusal detection)Attack Success Rate (ASR)Refusal To Answer (RTA)Fleiss Kappa (annotation agreement)

Datasets

WILDGUARDMIXWILDGUARDTRAIN (86,759 items)WILDGUARDTEST (5,299 human-annotated)WILDJAILBREAK (WJ)XSTEST-RESPToxicChatHarmBench (prompt & response)BeaverTailsSafeRLHFLMSYS-CHAT-1MWILDCHAT

Benchmarks

WILDGUARDTESTXSTEST-RESPToxicChatHarmBenchBeaverTailsSafeRLHFAegisSafetyTestSimpleSafetyTestsOpenAI Moderation dataset

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

WILDGUARD strongly improves refusal detection versus open baselines.

WILDGUARD matches or exceeds GPT-4 for some harmfulness tasks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding