Overview
The framework and benchmark are fully implemented and released; results cover many models and attack/defense combos, but judge variability and text-only scope limit universal production claims.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
PandaGuard shows that safety is not automatic: defenses lower jailbreak risk but add token cost and can reduce task performance, so businesses must test defenses per model and budget for extra inference cost.
Who Should Care
Summary TLDR
PandaGuard is an open, modular framework that models jailbreak safety as interactions between attackers, defenders, target LLMs, and judges. The authors implement 19 attacks and 12 defenses and run PANDABENCH, a 3-billion-token benchmark across 49 models. Key takeaways: no single defense wins everywhere; defenses typically cut attack success rates by ~33–50% but add cost and sometimes reduce utility; simple adaptive attacks (RandomSearch) remain effective (avg ASR 24%); judge choice matters — rule-based and LLM-based judges disagree substantially. Code, configs, and results are released for reproducible testing.
Problem Statement
Jailbreaking evaluations are fragmented: prior work tests isolated attacks or defenses, uses inconsistent metrics and judges, and runs at small scale. This makes it hard to compare methods, measure deployment cost, or trust safety verdicts.
Main Contribution
A modular multi-agent framework (PANDAGUARD) that unifies attackers, defenders, target LLMs, and judges with plugin interfaces.
PANDABENCH: a large-scale, reproducible benchmark (~3B tokens) testing 19 attacks, 12 defenses, and 49 LLMs.
Key Findings
No single defense works best for all models and harms.
Simple adaptive attacks remain strong.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Attack Success Rate (ASR) reduction | ≈33–50% reduction with defenses | No defense | −33% to −50% ASR | PANDABENCH (49 models, JBB-Behaviors) | Defenses consistently reduce ASR by about one-third to one-half | Section 4.1, Figure 2c |
| Top attack average ASR | RandomSearch 24% (avg) | other attacks | RandomSearch leads by ~9 percentage points over AIM | PANDABENCH | RandomSearch outperforms other techniques with an average ASR of 24% | Section 4.2, Figure 3 |
What To Try In 7 Days
Run PANDAGUARD on a representative model with your harmful-prompt set to measure baseline ASR.
Compare 2–3 defenses (e.g., Paraphrase, PerplexityFilter, SmoothLLM) and measure token cost and downstream quality.
Evaluate outputs with at least one LLM judge and a small human spot check for high-risk items.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Only evaluates text-only jailbreaks; multimodal attacks (images/audio) are not covered.
Judge variability creates subjective ASR estimates; single-judge results can be misleading.
When Not To Use
When you need multimodal safety assessment (images or audio).
As the sole safety arbiter for high-risk production without human review.
Failure Modes
Judge disagreement leads to unstable ASR estimates and operation decisions.
Defenses that lower ASR can significantly increase inference cost or reduce task utility.

