PandaGuard: a plug-and-play framework and 3B-token benchmark that tests 19 jailbreak attacks, 12 defenses, and 49 LLMs

May 20, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

1

Authors

Guobin Shen, Dongcheng Zhao, Linghao Feng, Xiang He, Jihang Wang, Sicheng Shen, Haibo Tong, Yiting Dong, Jindong Li, Xiang Zheng, Yi Zeng

Links

Abstract / PDF

Why It Matters For Business

PandaGuard shows that safety is not automatic: defenses lower jailbreak risk but add token cost and can reduce task performance, so businesses must test defenses per model and budget for extra inference cost.

Summary TLDR

PandaGuard is an open, modular framework that models jailbreak safety as interactions between attackers, defenders, target LLMs, and judges. The authors implement 19 attacks and 12 defenses and run PANDABENCH, a 3-billion-token benchmark across 49 models. Key takeaways: no single defense wins everywhere; defenses typically cut attack success rates by ~33–50% but add cost and sometimes reduce utility; simple adaptive attacks (RandomSearch) remain effective (avg ASR 24%); judge choice matters — rule-based and LLM-based judges disagree substantially. Code, configs, and results are released for reproducible testing.

Problem Statement

Jailbreaking evaluations are fragmented: prior work tests isolated attacks or defenses, uses inconsistent metrics and judges, and runs at small scale. This makes it hard to compare methods, measure deployment cost, or trust safety verdicts.

Main Contribution

A modular multi-agent framework (PANDAGUARD) that unifies attackers, defenders, target LLMs, and judges with plugin interfaces.

PANDABENCH: a large-scale, reproducible benchmark (~3B tokens) testing 19 attacks, 12 defenses, and 49 LLMs.

A broad empirical study that quantifies attack strengths, defense trade-offs (safety vs cost vs utility), and judge inconsistency.

Key Findings

No single defense works best for all models and harms.

NumbersDefenses reduce ASR by ~33–50% on evaluated models

Simple adaptive attacks remain strong.

NumbersRandomSearch avg ASR 24%; AIM 15%; PastTense 13%

Safety judges disagree and bias ASR estimates.

NumbersRule-based vs LLM judges Cohen's Kappa 0.071–0.126

Strong defenses often increase cost or reduce output quality.

NumbersSome defenses use up to 5× tokens; some reduce Alpaca winrate by up to 25%

Results

Attack Success Rate (ASR) reduction

Value≈33–50% reduction with defenses

BaselineNo defense

Top attack average ASR

ValueRandomSearch 24% (avg)

Baselineother attacks

Judge agreement (Cohen's Kappa)

Value0.071–0.126 (rule-based vs LLM judges)

Baselineperfect agreement = 1.0

Defense token cost

Valueup to 5× token usage for dialog-based defenses

BaselineBaseline (no defense)

Impact on model utility

Valueup to 25% drop in Alpaca winrate

BaselineBaseline model performance

Who Should Care

What To Try In 7 Days

Run PANDAGUARD on a representative model with your harmful-prompt set to measure baseline ASR.

Compare 2–3 defenses (e.g., Paraphrase, PerplexityFilter, SmoothLLM) and measure token cost and downstream quality.

Evaluate outputs with at least one LLM judge and a small human spot check for high-risk items.

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Only evaluates text-only jailbreaks; multimodal attacks (images/audio) are not covered.
  • Judge variability creates subjective ASR estimates; single-judge results can be misleading.
  • Benchmark uses JBB-Behaviors (100 prompts) which may not cover all real-world misuse scenarios.
  • Proxy model choice (Llama-3.1-8B) may bias adaptive attack generation.

When Not To Use

  • When you need multimodal safety assessment (images or audio).
  • As the sole safety arbiter for high-risk production without human review.
  • If your threat model demands white-box parameter interventions not supported by the framework.

Failure Modes

  • Judge disagreement leads to unstable ASR estimates and operation decisions.
  • Defenses that lower ASR can significantly increase inference cost or reduce task utility.
  • Adaptive attackers tuned to the same proxy model can overfit to evaluation setup and misrepresent real-world potency.

Core Entities

Models

  • GPT-4o
  • Claude-3.5/3.7
  • Llama-3.1-8B
  • Llama-3.3-70B
  • Qwen2.5
  • Qwen3-1.7B
  • DeepSeek-R1

Metrics

  • Attack Success Rate (ASR)
  • Cohen's Kappa
  • Token usage
  • Alpaca winrate

Datasets

  • JBB-Behaviors (JailbreakBench)

Benchmarks

  • PANDABENCH

Context Entities

Models

  • Llama-3.1-8B (proxy model used for attack/defense generation)