Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Red teams and safety teams should verify that flagged jailbreaks are factually correct and executable; otherwise you waste resources on false alarms and miss real risks.
Summary TLDR
Existing red-teaming metrics often treat any non-refusal as a successful jailbreak. The paper shows many such "successes" are hallucinations — wrong, incoherent, or non-actionable outputs. It introduces BABYBLUE: a three-stage evaluation pipeline (classification, textual checks, functionality checks) plus six evaluators (general, coherence, context, instruction, knowledge, toxicity) and an augmented dataset. On their tests, BABYBLUE lowers false positives and cuts measured attack success rates (ASR) on many models (example: LLAMA2-7B GCG ASR 0.51→0.09) while keeping recall similar. The code and evaluation method are shared to help teams focus on real threats rather than spurious alarms.
Problem Statement
Current jailbreak benchmarks often overstate risk because they count superficially malicious-looking outputs as successful attacks even when those outputs are hallucinations — factually wrong, irrelevant or non-executable. We need an evaluation that checks whether outputs are actually actionable and harmful.
Main Contribution
Diagnosis: classifies jailbreak hallucinations into input-, context-, fact-conflicting and logical incoherence types.
BABYBLUE: a three-stage evaluation pipeline (classification → textual quality → functionality) with six dedicated evaluators.
A dataset augmentation (100 added/modified samples) and reference knowledge/execution artifacts for better verification.
Empirical results on 28 models and 16 red‑teaming methods showing lower false positives and more conservative ASR estimates.
Key Findings
Many detected jailbreak successes are hallucinations or non-actionable outputs.
BABYBLUE substantially lowers measured attack success rates on evaluated models.
BABYBLUE reduces false positives while keeping detection (recall) similar.
Closed-source models tend to produce fewer hallucinations and more genuinely harmful completions under these tests.
Results
F1 score (expert-labelled sample)
Attack Success Rate (ASR) example
Who Should Care
What To Try In 7 Days
Run current red‑team outputs through a three-stage filter: alignment classifier, text-quality checks, then execution/feasibility tests.
Add a small set (50–200) of ground-truth reference examples and executable testcases for your top risky categories.
Measure F1, precision and ASR before/after validation to see if false positives drop as in the paper.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Fixed evaluator set may miss novel jailbreak strategies that exploit new modalities or chains of reasoning.
- Augmented dataset (100 samples added) may not represent all real-world attack variants.
- Knowledge evaluator relies on uncensored LLMs and expert references, which can still make mistakes.
When Not To Use
- When you need a fast, coarse estimate and cannot run sandbox or human checks.
- For open-ended behaviors lacking ground-truth references or executable targets.
- If you lack infrastructure for safe sandboxed execution of potentially harmful outputs.
Failure Modes
- Uncensored LLM evaluators themselves hallucinate and misjudge factuality.
- Execution environment mismatch causes functional checks to fail spuriously.
- Adversaries evolve prompts to bypass textual checks (e.g., subtle, contextual tricks).
Core Entities
Models
- LLAMA2-7B-CHAT
- LLAMA2-13B-CHAT
- Vicuna-7B
- Vicuna-13B
- Mistral-7B
- Mixtral-8x7B
- Baichuan-2
- Qwen
- Koala
- Orca-2
- GPT-3.5
- GPT-4
- Claude
Metrics
- Attack Success Rate (ASR)
- Recall
- Precision
- F1
Datasets
- HarmBench
- AdvBench
- BABYBLUE (augmented dataset)
Benchmarks
- HarmBench
- AdvBench
- BABYBLUE

