PandaGuard: a plug-and-play framework and 3B-token benchmark that tests 19 jailbreak attacks, 12 defenses, and 49 LLMs

Overview

Decision SnapshotReady For Pilot

The framework and benchmark are fully implemented and released; results cover many models and attack/defense combos, but judge variability and text-only scope limit universal production claims.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Guobin Shen, Dongcheng Zhao, Linghao Feng, Xiang He, Jihang Wang, Sicheng Shen, Haibo Tong, Yiting Dong, Jindong Li, Xiang Zheng, Yi Zeng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

PandaGuard shows that safety is not automatic: defenses lower jailbreak risk but add token cost and can reduce task performance, so businesses must test defenses per model and budget for extra inference cost.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

PandaGuard is an open, modular framework that models jailbreak safety as interactions between attackers, defenders, target LLMs, and judges. The authors implement 19 attacks and 12 defenses and run PANDABENCH, a 3-billion-token benchmark across 49 models. Key takeaways: no single defense wins everywhere; defenses typically cut attack success rates by ~33–50% but add cost and sometimes reduce utility; simple adaptive attacks (RandomSearch) remain effective (avg ASR 24%); judge choice matters — rule-based and LLM-based judges disagree substantially. Code, configs, and results are released for reproducible testing.

Problem Statement

Jailbreaking evaluations are fragmented: prior work tests isolated attacks or defenses, uses inconsistent metrics and judges, and runs at small scale. This makes it hard to compare methods, measure deployment cost, or trust safety verdicts.

Main Contribution

A modular multi-agent framework (PANDAGUARD) that unifies attackers, defenders, target LLMs, and judges with plugin interfaces.

PANDABENCH: a large-scale, reproducible benchmark (~3B tokens) testing 19 attacks, 12 defenses, and 49 LLMs.

Key Findings

No single defense works best for all models and harms.

NumbersDefenses reduce ASR by ~33–50% on evaluated models

Practical UsePick and tune defenses per model and use combined strategies rather than a one-size-fits-all solution.

Evidence RefSection 4.1, Figure 2c

Simple adaptive attacks remain strong.

NumbersRandomSearch avg ASR 24%; AIM 15%; PastTense 13%

Practical UseDefenses must be validated against diverse attacks, including simple stochastic searches, not just hand-crafted templates.

Evidence RefSection 4.2, Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Attack Success Rate (ASR) reduction	≈33–50% reduction with defenses	No defense	−33% to −50% ASR	PANDABENCH (49 models, JBB-Behaviors)	Defenses consistently reduce ASR by about one-third to one-half	Section 4.1, Figure 2c
Top attack average ASR	RandomSearch 24% (avg)	other attacks	RandomSearch leads by ~9 percentage points over AIM	PANDABENCH	RandomSearch outperforms other techniques with an average ASR of 24%	Section 4.2, Figure 3

What To Try In 7 Days

Run PANDAGUARD on a representative model with your harmful-prompt set to measure baseline ASR.

Compare 2–3 defenses (e.g., Paraphrase, PerplexityFilter, SmoothLLM) and measure token cost and downstream quality.

Evaluate outputs with at least one LLM judge and a small human spot check for high-risk items.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/Beijing-AISI/panda-guard https://hf.co/datasets/Beijing-AISI/panda-bench https://panda-guard.github.io

Data URLs

https://hf.co/datasets/Beijing-AISI/panda-benchJBB-Behaviors (JailbreakBench) - referenced dataset

Risks & Boundaries

Limitations

Only evaluates text-only jailbreaks; multimodal attacks (images/audio) are not covered.

Judge variability creates subjective ASR estimates; single-judge results can be misleading.

When Not To Use

When you need multimodal safety assessment (images or audio).

As the sole safety arbiter for high-risk production without human review.

Failure Modes

Judge disagreement leads to unstable ASR estimates and operation decisions.

Defenses that lower ASR can significantly increase inference cost or reduce task utility.

Core Entities

Models

GPT-4oClaude-3.5/3.7Llama-3.1-8BLlama-3.3-70BQwen2.5Qwen3-1.7BDeepSeek-R1

Metrics

Attack Success Rate (ASR)Cohen's KappaToken usageAlpaca winrate

Datasets

JBB-Behaviors (JailbreakBench)

Benchmarks

PANDABENCH

Context Entities

Models

Llama-3.1-8B (proxy model used for attack/defense generation)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

No single defense works best for all models and harms.

Simple adaptive attacks remain strong.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

ObjexMT: test if LLM "judges" can recover hidden objectives and know when they're confident

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

Add intent-aware JWTs and a client shim to stop agents from misusing shared OAuth tokens

Key finding

Judge-free, multilingual jailbreak stress test for 12 South Asian languages with 45k+ prompts

Key finding

Many jailbreak detections are hallucinations — BABYBLUE validates which outputs are truly harmful

Key finding