Latent-jailbreak benchmark: test if hidden malicious text breaks model safety or instruction following

July 17, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

11

Authors

Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, Zhenzhong Lan

Links

Abstract / PDF

Why It Matters For Business

Hidden malicious instructions can either make models produce unsafe content or cause over-refusal and lost utility; you must measure both safety and instruction-following before deployment to avoid surprise failures.

Summary TLDR

The paper introduces the Latent Jailbreak benchmark: synthetic prompts that hide malicious instructions inside otherwise normal tasks (e.g., translate this sentence, where the sentence itself tells the model to produce toxic text). They release 416 prompt variants (13 templates × protected-group tokens × 2 positions) and a hierarchical human+automatic labeling procedure to measure three things: jailbreak success (unsafe outputs), robustness (does the model follow the explicit normal instruction), and trustworthiness (combined metric). Tests on three models (ChatGLM2-6B, BELLE-7B-2M, GPT-3.5) show wide variation: some models are easy to jailbreak, others over-refuse and fail to follow benign,

Problem Statement

Current jailbreak benchmarks focus on whether a model outputs unsafe text, but ignore whether the model still follows the user's explicit instruction (output robustness). The gap: models can either be unsafe or over-refuse and lose competence. The paper builds a dataset and metrics to evaluate both safety and instruction-following robustness together.

Main Contribution

A latent jailbreak prompt dataset that embeds malicious instructions inside normal tasks and swaps instruction position (prefix vs suffix).

A hierarchical annotation framework (safety vs robustness) combining human labels and an automatic RoBERTa classifier.

Systematic analyses showing how instruction position, cue words, verbs, target groups, and toxic adjectives affect jailbreak rates across three LLMs.

Key Findings

Different models have very different jailbreak success rates on the same prompts.

NumbersP1 jailbreak success: ChatGLM2-6B 75.3%, BELLE-7B-2M 50.4%, ChatGPT 22.6% (Table 4)

Including a clear cue word (e.g., 'sentence') sharply reduces jailbreaks for some models.

NumbersP2: BELLE-7B-2M jailbreak 3.1%, robustness 96.7%, trustworthiness 93.6% (Table 4)

Instruction position matters: models are safer and follow instructions more when the explicit task instruction is a prefix.

NumbersAcross templates, suffix placement yields much higher unsafe output (see Fig.5 and Table 4 comparisons)

Some models trade safety for usefulness by over-refusing and thus show low robustness.

NumbersChatGPT produced most safe responses for many templates but had low robustness (e.g., P1 robustness 1.4% and trustworth.

Sensitivity varies by instruction verb and toxic adjective.

NumbersPrompt types P11–P13 and P6–P10 in Table 4 show large shifts in jailbreak rates depending on 'write/translate/paraphrase

Results

Jailbreak success rate (P1)

ValueChatGLM2-6B 75.3%, BELLE-7B-2M 50.4%, ChatGPT 22.6%

Robustness (follow explicit instruction) (P1)

ValueChatGLM2-6B 0.0%, BELLE-7B-2M 41.8%, ChatGPT 1.4%

Effect of cue word 'sentence' (P2)

ValueBELLE-7B-2M jailbreak 3.1%, robustness 96.7%, trustworthiness 93.6%

BaselineCompared to same model P1 (50.4% jailbreak)

Who Should Care

What To Try In 7 Days

Run the provided Latent Jailbreak prompts (or variants) against your model to get a baseline on jailbreak success and robustness.

Standardize prompts: put the task instruction first and add explicit cue words like 'sentence' to reduce hidden-instruction execution.

Use the paper's two-stage labeling approach: spot-check human labels and then fine-tune a classifier (RoBERTa) to scale labeling of generations.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Dataset focuses on toxic instructions and protected-group templates; other malicious goals (e.g., misinformation, scams) are not covered.
  • Experiments test three models only (two open models and GPT-3.5), so results may not generalize to larger or different architectures.
  • Automatic labeling relies on a RoBERTa classifier fine-tuned on a P1 seed; classifier bias or labeling errors can affect aggregated metrics.

When Not To Use

  • Do not assume passing this benchmark guarantees real-world safety against other jailbreak styles or non-toxic attacks.
  • Not suitable for multimodal inputs or voice/visual prompt-injection testing.
  • Not a direct replacement for adversarial red-teaming that targets system-level integrations.

Failure Modes

  • Over-refusal: an aligned model rejects benign tasks and loses utility (observed for ChatGPT).
  • Position sensitivity: models may ignore suffix instructions and execute hidden malicious text.
  • Judge bias: automatic labels depend on the RoBERTa model fine-tuned on limited data, risking misclassification.

Core Entities

Models

  • ChatGLM2-6B
  • BELLE-7B-2M
  • ChatGPT (GPT-3.5-turbo-0613)

Metrics

  • Jailbreak success rate
  • Robustness (% following explicit instruction)
  • Trustworthiness (combined safety+robustness metric)

Datasets

  • Latent Jailbreak prompt dataset (416 prompt variants)

Benchmarks

  • Latent Jailbreak benchmark