WILDGUARD: open multi-task moderator that matches GPT‑4 and cuts jailbreak success to near zero

June 26, 20249 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

3

Authors

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri

Links

Abstract / PDF

Why It Matters For Business

WILDGUARD gives teams an open, deployable moderator that matches closed APIs on many safety checks, reduces jailbreak risk sharply, and lowers reliance on expensive third‑party moderation services.

Summary TLDR

WILDGUARD is an open-source, multi-task moderation system and dataset (WILDGUARDMIX) for LLM safety. It classifies harmful prompts, harmful responses, and whether a model refused. Trained on 92K balanced items (87K train + 5K human-annotated test), WILDGUARD attains state-of-the-art open-source performance, matches or slightly exceeds GPT-4 on some harmfulness tasks, and sharply reduces jailbreak success in a moderation pipeline (example: attack success falls from 79.8% to 2.4%). The project ships code and data for practical evaluation and deployment.

Problem Statement

Current open moderation tools miss many adversarial jailbreaking prompts and cannot reliably detect nuanced refusals in model outputs. This forces reliance on costly closed APIs and leaves safety gaps when testing or deploying LLMs.

Main Contribution

WILDGUARD: a unified, open moderator that labels prompt harmfulness, response harmfulness, and response refusal in one model.

WILDGUARDMIX: a large, balanced multi-task dataset with 92K labeled examples (WGTRAIN 86,759 train / WGTEST 5,299 human-annotated test) covering 13 risk subcategories and both vanilla and adversarial prompts.

Empirical gains: WILDGUARD outperforms prior open tools across public benchmarks and WGTEST, and equals or sometimes beats GPT-4 on prompt harmfulness.

Practical demo: used as an inference-time filter, WILDGUARD reduces jailbreak Attack Success Rate (ASR) drastically with minimal over-refusal.

Open release: model code and dataset published (GitHub and Hugging Face) for reuse and audit.

Key Findings

WILDGUARD strongly improves refusal detection versus open baselines.

NumbersRefusal F1 +26.4 pts vs LibrAI-LongFormer-ref on WGTEST/XSTEST-RESP

WILDGUARD matches or exceeds GPT-4 for some harmfulness tasks.

NumbersPrompt harmfulness: +1.8 pts avg F1 on public benchmarks; +3.9 pts on adversarial subset vs GPT-4

WILDGUARD cuts jailbreak success in a moderation pipeline to a small fraction.

NumbersASR reduced from 79.8% to 2.4% in Tulu-2 demo with WILDGUARD filter

Training data scale and mix matter: diverse sources drive performance.

NumbersWILDGUARDMIX = 92K examples (86,759 train + 5,299 test); removing adversarial synthetic data drops adv. prompt F1 by >8.

Multi-task training helps most tasks vs single-task models.

NumbersMulti-task WILDGUARD improves average F1 on public evaluations vs single-task in ablations

Results

Prompt harmfulness (avg F1 on public benchmarks)

Value86.1%

BaselineGPT-4 84.6%

Prompt harmfulness (WGTEST total F1)

Value88.9%

BaselineGPT-4 87.9%

Response harmfulness (avg F1 on public benchmarks)

Value82.4%

BaselineGPT-4 82.0%

Refusal detection (XSTEST-RESP F1)

Value92.8%

BaselineGPT-4 98.1%

Refusal detection (WGTEST total F1)

Value88.6%

BaselineGPT-4 92.4%

Jailbreak Attack Success Rate (ASR) in moderation demo

Value2.4% (with WILDGUARD)

Baseline79.8% (no filter)

Who Should Care

What To Try In 7 Days

Run WILDGUARD as an inference-time filter in front of your assistant and measure ASR and benign RTA.

Score your model outputs with WILDGUARD to compare refusal and harm rates vs current tools (report F1 and ASR).

Evaluate WILDGUARD on a held-out set of adversarial prompts your team cares about and inspect false positives/negatives manually.

Optimization Features

Infra Optimization

  • Training done on 4x A100 80GB in ~5 hours

Training Optimization

  • Instruction-tuning Mistral-7B-v0.3 on WILDGUARDTRAIN (2 epochs, lr 2e-6)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Much of WGTRAIN is synthetic; real-world coverage is limited by in-the-wild sample size.
  • Refusal definitions and harm taxonomy are choices that may not align with every deployment's policy.
  • Not a fine-grained harm classifier (no detailed harm subcategory labels per response).
  • On some refusal benchmarks GPT‑4 still scores higher; edge cases remain.

When Not To Use

  • If your moderation policy requires a different harm taxonomy than WILDGUARD's definitions.
  • When you need per-instance legal adjudication or human-only review for high-stakes decisions.
  • If you need fine-grained category labels (WILDGUARD focuses on three binary tasks).

Failure Modes

  • False negatives on novel adversarial tactics not present in WGTRAIN.
  • False positives where a nuanced compliance contains caveats and is misclassified as refusal.
  • Labeling mismatch when an organization’s harm definitions differ from the dataset's.

Core Entities

Models

  • WILDGUARD (instruction-tuned Mistral-7B-v0.3)
  • Llama-Guard2
  • Aegis-Guard (Defensive & Permissive)
  • MD-Judge
  • BeaverDam
  • LibrAI-LongFormer
  • HarmBench classifiers
  • GPT-4 (gpt-4-0125-preview)
  • OpenAI Moderation API

Metrics

  • F1 (prompt harmfulness, response harmfulness, refusal detection)
  • Attack Success Rate (ASR)
  • Refusal To Answer (RTA)
  • Fleiss Kappa (annotation agreement)

Datasets

  • WILDGUARDMIX
  • WILDGUARDTRAIN (86,759 items)
  • WILDGUARDTEST (5,299 human-annotated)
  • WILDJAILBREAK (WJ)
  • XSTEST-RESP
  • ToxicChat
  • HarmBench (prompt & response)
  • BeaverTails
  • SafeRLHF
  • LMSYS-CHAT-1M
  • WILDCHAT

Benchmarks

  • WILDGUARDTEST
  • XSTEST-RESP
  • ToxicChat
  • HarmBench
  • BeaverTails
  • SafeRLHF
  • AegisSafetyTest
  • SimpleSafetyTests
  • OpenAI Moderation dataset