Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
PandaGuard shows that safety is not automatic: defenses lower jailbreak risk but add token cost and can reduce task performance, so businesses must test defenses per model and budget for extra inference cost.
Summary TLDR
PandaGuard is an open, modular framework that models jailbreak safety as interactions between attackers, defenders, target LLMs, and judges. The authors implement 19 attacks and 12 defenses and run PANDABENCH, a 3-billion-token benchmark across 49 models. Key takeaways: no single defense wins everywhere; defenses typically cut attack success rates by ~33–50% but add cost and sometimes reduce utility; simple adaptive attacks (RandomSearch) remain effective (avg ASR 24%); judge choice matters — rule-based and LLM-based judges disagree substantially. Code, configs, and results are released for reproducible testing.
Problem Statement
Jailbreaking evaluations are fragmented: prior work tests isolated attacks or defenses, uses inconsistent metrics and judges, and runs at small scale. This makes it hard to compare methods, measure deployment cost, or trust safety verdicts.
Main Contribution
A modular multi-agent framework (PANDAGUARD) that unifies attackers, defenders, target LLMs, and judges with plugin interfaces.
PANDABENCH: a large-scale, reproducible benchmark (~3B tokens) testing 19 attacks, 12 defenses, and 49 LLMs.
A broad empirical study that quantifies attack strengths, defense trade-offs (safety vs cost vs utility), and judge inconsistency.
Key Findings
No single defense works best for all models and harms.
Simple adaptive attacks remain strong.
Safety judges disagree and bias ASR estimates.
Strong defenses often increase cost or reduce output quality.
Results
Attack Success Rate (ASR) reduction
Top attack average ASR
Judge agreement (Cohen's Kappa)
Defense token cost
Impact on model utility
Who Should Care
What To Try In 7 Days
Run PANDAGUARD on a representative model with your harmful-prompt set to measure baseline ASR.
Compare 2–3 defenses (e.g., Paraphrase, PerplexityFilter, SmoothLLM) and measure token cost and downstream quality.
Evaluate outputs with at least one LLM judge and a small human spot check for high-risk items.
Reproducibility
Code Urls
Data Urls
- https://hf.co/datasets/Beijing-AISI/panda-bench
- JBB-Behaviors (JailbreakBench) - referenced dataset
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Only evaluates text-only jailbreaks; multimodal attacks (images/audio) are not covered.
- Judge variability creates subjective ASR estimates; single-judge results can be misleading.
- Benchmark uses JBB-Behaviors (100 prompts) which may not cover all real-world misuse scenarios.
- Proxy model choice (Llama-3.1-8B) may bias adaptive attack generation.
When Not To Use
- When you need multimodal safety assessment (images or audio).
- As the sole safety arbiter for high-risk production without human review.
- If your threat model demands white-box parameter interventions not supported by the framework.
Failure Modes
- Judge disagreement leads to unstable ASR estimates and operation decisions.
- Defenses that lower ASR can significantly increase inference cost or reduce task utility.
- Adaptive attackers tuned to the same proxy model can overfit to evaluation setup and misrepresent real-world potency.
Core Entities
Models
- GPT-4o
- Claude-3.5/3.7
- Llama-3.1-8B
- Llama-3.3-70B
- Qwen2.5
- Qwen3-1.7B
- DeepSeek-R1
Metrics
- Attack Success Rate (ASR)
- Cohen's Kappa
- Token usage
- Alpaca winrate
Datasets
- JBB-Behaviors (JailbreakBench)
Benchmarks
- PANDABENCH
Context Entities
Models
- Llama-3.1-8B (proxy model used for attack/defense generation)

