Many jailbreak detections are hallucinations — BABYBLUE validates which outputs are truly harmful

June 17, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Jiayi Mao, Xueqi Cheng

Links

Abstract / PDF

Why It Matters For Business

Red teams and safety teams should verify that flagged jailbreaks are factually correct and executable; otherwise you waste resources on false alarms and miss real risks.

Summary TLDR

Existing red-teaming metrics often treat any non-refusal as a successful jailbreak. The paper shows many such "successes" are hallucinations — wrong, incoherent, or non-actionable outputs. It introduces BABYBLUE: a three-stage evaluation pipeline (classification, textual checks, functionality checks) plus six evaluators (general, coherence, context, instruction, knowledge, toxicity) and an augmented dataset. On their tests, BABYBLUE lowers false positives and cuts measured attack success rates (ASR) on many models (example: LLAMA2-7B GCG ASR 0.51→0.09) while keeping recall similar. The code and evaluation method are shared to help teams focus on real threats rather than spurious alarms.

Problem Statement

Current jailbreak benchmarks often overstate risk because they count superficially malicious-looking outputs as successful attacks even when those outputs are hallucinations — factually wrong, irrelevant or non-executable. We need an evaluation that checks whether outputs are actually actionable and harmful.

Main Contribution

Diagnosis: classifies jailbreak hallucinations into input-, context-, fact-conflicting and logical incoherence types.

BABYBLUE: a three-stage evaluation pipeline (classification → textual quality → functionality) with six dedicated evaluators.

A dataset augmentation (100 added/modified samples) and reference knowledge/execution artifacts for better verification.

Empirical results on 28 models and 16 red‑teaming methods showing lower false positives and more conservative ASR estimates.

Key Findings

Many detected jailbreak successes are hallucinations or non-actionable outputs.

NumbersF1: AdavBench 0.432 -> BABYBLUE 0.805 on sampled expert review

BABYBLUE substantially lowers measured attack success rates on evaluated models.

NumbersLLAMA2-7B-CHAT, GCG ASR 0.51 -> 0.09 (≈82% drop) on supplementary dataset

BABYBLUE reduces false positives while keeping detection (recall) similar.

NumbersRecall: HarmBench 0.778 vs BABYBLUE 0.756; Precision: 0.636 -> 0.861

Closed-source models tend to produce fewer hallucinations and more genuinely harmful completions under these tests.

Results

F1 score (expert-labelled sample)

ValueBABYBLUE 0.805

BaselineHarmBench 0.700; AdavBench 0.432

Attack Success Rate (ASR) example

ValueLLAMA2-7B-CHAT GCG: 0.09 (BABYBLUE)

Baseline0.51 (AdavBench)

Who Should Care

What To Try In 7 Days

Run current red‑team outputs through a three-stage filter: alignment classifier, text-quality checks, then execution/feasibility tests.

Add a small set (50–200) of ground-truth reference examples and executable testcases for your top risky categories.

Measure F1, precision and ASR before/after validation to see if false positives drop as in the paper.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Fixed evaluator set may miss novel jailbreak strategies that exploit new modalities or chains of reasoning.
  • Augmented dataset (100 samples added) may not represent all real-world attack variants.
  • Knowledge evaluator relies on uncensored LLMs and expert references, which can still make mistakes.

When Not To Use

  • When you need a fast, coarse estimate and cannot run sandbox or human checks.
  • For open-ended behaviors lacking ground-truth references or executable targets.
  • If you lack infrastructure for safe sandboxed execution of potentially harmful outputs.

Failure Modes

  • Uncensored LLM evaluators themselves hallucinate and misjudge factuality.
  • Execution environment mismatch causes functional checks to fail spuriously.
  • Adversaries evolve prompts to bypass textual checks (e.g., subtle, contextual tricks).

Core Entities

Models

  • LLAMA2-7B-CHAT
  • LLAMA2-13B-CHAT
  • Vicuna-7B
  • Vicuna-13B
  • Mistral-7B
  • Mixtral-8x7B
  • Baichuan-2
  • Qwen
  • Koala
  • Orca-2
  • GPT-3.5
  • GPT-4
  • Claude

Metrics

  • Attack Success Rate (ASR)
  • Recall
  • Precision
  • F1

Datasets

  • HarmBench
  • AdvBench
  • BABYBLUE (augmented dataset)

Benchmarks

  • HarmBench
  • AdvBench
  • BABYBLUE