PandaGuard: a plug-and-play framework and 3B-token benchmark that tests 19 jailbreak attacks, 12 defenses, and 49 LLMs

May 20, 20257 min

Overview

Decision SnapshotReady For Pilot

The framework and benchmark are fully implemented and released; results cover many models and attack/defense combos, but judge variability and text-only scope limit universal production claims.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Guobin Shen, Dongcheng Zhao, Linghao Feng, Xiang He, Jihang Wang, Sicheng Shen, Haibo Tong, Yiting Dong, Jindong Li, Xiang Zheng, Yi Zeng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

PandaGuard shows that safety is not automatic: defenses lower jailbreak risk but add token cost and can reduce task performance, so businesses must test defenses per model and budget for extra inference cost.

Who Should Care

Summary TLDR

PandaGuard is an open, modular framework that models jailbreak safety as interactions between attackers, defenders, target LLMs, and judges. The authors implement 19 attacks and 12 defenses and run PANDABENCH, a 3-billion-token benchmark across 49 models. Key takeaways: no single defense wins everywhere; defenses typically cut attack success rates by ~33–50% but add cost and sometimes reduce utility; simple adaptive attacks (RandomSearch) remain effective (avg ASR 24%); judge choice matters — rule-based and LLM-based judges disagree substantially. Code, configs, and results are released for reproducible testing.

Problem Statement

Jailbreaking evaluations are fragmented: prior work tests isolated attacks or defenses, uses inconsistent metrics and judges, and runs at small scale. This makes it hard to compare methods, measure deployment cost, or trust safety verdicts.

Main Contribution

A modular multi-agent framework (PANDAGUARD) that unifies attackers, defenders, target LLMs, and judges with plugin interfaces.

PANDABENCH: a large-scale, reproducible benchmark (~3B tokens) testing 19 attacks, 12 defenses, and 49 LLMs.

Key Findings

No single defense works best for all models and harms.

NumbersDefenses reduce ASR by ~3350% on evaluated models

Practical UsePick and tune defenses per model and use combined strategies rather than a one-size-fits-all solution.

Evidence RefSection 4.1, Figure 2c

Simple adaptive attacks remain strong.

NumbersRandomSearch avg ASR 24%; AIM 15%; PastTense 13%

Practical UseDefenses must be validated against diverse attacks, including simple stochastic searches, not just hand-crafted templates.

Evidence RefSection 4.2, Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Attack Success Rate (ASR) reduction≈3350% reduction with defensesNo defense−33% to −50% ASRPANDABENCH (49 models, JBB-Behaviors)Defenses consistently reduce ASR by about one-third to one-halfSection 4.1, Figure 2c
Top attack average ASRRandomSearch 24% (avg)other attacksRandomSearch leads by ~9 percentage points over AIMPANDABENCHRandomSearch outperforms other techniques with an average ASR of 24%Section 4.2, Figure 3

What To Try In 7 Days

Run PANDAGUARD on a representative model with your harmful-prompt set to measure baseline ASR.

Compare 2–3 defenses (e.g., Paraphrase, PerplexityFilter, SmoothLLM) and measure token cost and downstream quality.

Evaluate outputs with at least one LLM judge and a small human spot check for high-risk items.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

https://hf.co/datasets/Beijing-AISI/panda-benchJBB-Behaviors (JailbreakBench) - referenced dataset

Risks & Boundaries

Limitations

Only evaluates text-only jailbreaks; multimodal attacks (images/audio) are not covered.

Judge variability creates subjective ASR estimates; single-judge results can be misleading.

When Not To Use

When you need multimodal safety assessment (images or audio).

As the sole safety arbiter for high-risk production without human review.

Failure Modes

Judge disagreement leads to unstable ASR estimates and operation decisions.

Defenses that lower ASR can significantly increase inference cost or reduce task utility.

Core Entities

Models

GPT-4oClaude-3.5/3.7Llama-3.1-8BLlama-3.3-70BQwen2.5Qwen3-1.7BDeepSeek-R1

Metrics

Attack Success Rate (ASR)Cohen's KappaToken usageAlpaca winrate

Datasets

JBB-Behaviors (JailbreakBench)

Benchmarks

PANDABENCH

Context Entities

Models

Llama-3.1-8B (proxy model used for attack/defense generation)