Latent-jailbreak benchmark: test if hidden malicious text breaks model safety or instruction following

July 17, 20238 min

Overview

Decision SnapshotNeeds Validation

Dataset and code are available and experiments are reproducible; evaluation covers multiple prompt templates and three models but is limited to toxicity templates, two languages, and a small model set.

Citations11

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 60%

Authors

Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, Zhenzhong Lan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Hidden malicious instructions can either make models produce unsafe content or cause over-refusal and lost utility; you must measure both safety and instruction-following before deployment to avoid surprise failures.

Who Should Care

Summary TLDR

The paper introduces the Latent Jailbreak benchmark: synthetic prompts that hide malicious instructions inside otherwise normal tasks (e.g., translate this sentence, where the sentence itself tells the model to produce toxic text). They release 416 prompt variants (13 templates × protected-group tokens × 2 positions) and a hierarchical human+automatic labeling procedure to measure three things: jailbreak success (unsafe outputs), robustness (does the model follow the explicit normal instruction), and trustworthiness (combined metric). Tests on three models (ChatGLM2-6B, BELLE-7B-2M, GPT-3.5) show wide variation: some models are easy to jailbreak, others over-refuse and fail to follow benign,

Problem Statement

Current jailbreak benchmarks focus on whether a model outputs unsafe text, but ignore whether the model still follows the user's explicit instruction (output robustness). The gap: models can either be unsafe or over-refuse and lose competence. The paper builds a dataset and metrics to evaluate both safety and instruction-following robustness together.

Main Contribution

A latent jailbreak prompt dataset that embeds malicious instructions inside normal tasks and swaps instruction position (prefix vs suffix).

A hierarchical annotation framework (safety vs robustness) combining human labels and an automatic RoBERTa classifier.

Key Findings

Different models have very different jailbreak success rates on the same prompts.

NumbersP1 jailbreak success: ChatGLM2-6B 75.3%, BELLE-7B-2M 50.4%, ChatGPT 22.6% (Table 4)

Practical UseRun your safety tests on the exact model and prompt variants you plan to deploy; one model's safe result does not generalize to another.

Evidence RefTable 4, Prompt Type P1

Including a clear cue word (e.g., 'sentence') sharply reduces jailbreaks for some models.

NumbersP2: BELLE-7B-2M jailbreak 3.1%, robustness 96.7%, trustworthiness 93.6% (Table 4)

Practical UseIf you control prompt format, adding explicit cue words can materially reduce unintended execution of hidden instructions; test cue words in your pipeline.

Evidence RefTable 4, Prompt Type P2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Jailbreak success rate (P1)ChatGLM2-6B 75.3%, BELLE-7B-2M 50.4%, ChatGPT 22.6%Prompt Type P1 (toxicity translation template)Table 4 reports per-model success rates on P1Table 4
Robustness (follow explicit instruction) (P1)ChatGLM2-6B 0.0%, BELLE-7B-2M 41.8%, ChatGPT 1.4%Prompt Type P1Table 4 robustness column for P1Table 4

What To Try In 7 Days

Run the provided Latent Jailbreak prompts (or variants) against your model to get a baseline on jailbreak success and robustness.

Standardize prompts: put the task instruction first and add explicit cue words like 'sentence' to reduce hidden-instruction execution.

Use the paper's two-stage labeling approach: spot-check human labels and then fine-tune a classifier (RoBERTa) to scale labeling of generations.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Dataset focuses on toxic instructions and protected-group templates; other malicious goals (e.g., misinformation, scams) are not covered.

Experiments test three models only (two open models and GPT-3.5), so results may not generalize to larger or different architectures.

When Not To Use

Do not assume passing this benchmark guarantees real-world safety against other jailbreak styles or non-toxic attacks.

Not suitable for multimodal inputs or voice/visual prompt-injection testing.

Failure Modes

Over-refusal: an aligned model rejects benign tasks and loses utility (observed for ChatGPT).

Position sensitivity: models may ignore suffix instructions and execute hidden malicious text.

Core Entities

Models

ChatGLM2-6BBELLE-7B-2MChatGPT (GPT-3.5-turbo-0613)

Metrics

Jailbreak success rateRobustness (% following explicit instruction)Trustworthiness (combined safety+robustness metric)

Datasets

Latent Jailbreak prompt dataset (416 prompt variants)

Benchmarks

Latent Jailbreak benchmark