Latent-jailbreak benchmark: test if hidden malicious text breaks model safety or instruction following

Overview

Decision SnapshotNeeds Validation

Dataset and code are available and experiments are reproducible; evaluation covers multiple prompt templates and three models but is limited to toxicity templates, two languages, and a small model set.

Citations11

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 60%

Authors

Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, Zhenzhong Lan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Hidden malicious instructions can either make models produce unsafe content or cause over-refusal and lost utility; you must measure both safety and instruction-following before deployment to avoid surprise failures.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead Data Scientist

Summary TLDR

The paper introduces the Latent Jailbreak benchmark: synthetic prompts that hide malicious instructions inside otherwise normal tasks (e.g., translate this sentence, where the sentence itself tells the model to produce toxic text). They release 416 prompt variants (13 templates × protected-group tokens × 2 positions) and a hierarchical human+automatic labeling procedure to measure three things: jailbreak success (unsafe outputs), robustness (does the model follow the explicit normal instruction), and trustworthiness (combined metric). Tests on three models (ChatGLM2-6B, BELLE-7B-2M, GPT-3.5) show wide variation: some models are easy to jailbreak, others over-refuse and fail to follow benign,

Problem Statement

Current jailbreak benchmarks focus on whether a model outputs unsafe text, but ignore whether the model still follows the user's explicit instruction (output robustness). The gap: models can either be unsafe or over-refuse and lose competence. The paper builds a dataset and metrics to evaluate both safety and instruction-following robustness together.

Main Contribution

A latent jailbreak prompt dataset that embeds malicious instructions inside normal tasks and swaps instruction position (prefix vs suffix).

A hierarchical annotation framework (safety vs robustness) combining human labels and an automatic RoBERTa classifier.

Key Findings

Different models have very different jailbreak success rates on the same prompts.

NumbersP1 jailbreak success: ChatGLM2-6B 75.3%, BELLE-7B-2M 50.4%, ChatGPT 22.6% (Table 4)

Practical UseRun your safety tests on the exact model and prompt variants you plan to deploy; one model's safe result does not generalize to another.

Evidence RefTable 4, Prompt Type P1

Including a clear cue word (e.g., 'sentence') sharply reduces jailbreaks for some models.

NumbersP2: BELLE-7B-2M jailbreak 3.1%, robustness 96.7%, trustworthiness 93.6% (Table 4)

Practical UseIf you control prompt format, adding explicit cue words can materially reduce unintended execution of hidden instructions; test cue words in your pipeline.

Evidence RefTable 4, Prompt Type P2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Jailbreak success rate (P1)	ChatGLM2-6B 75.3%, BELLE-7B-2M 50.4%, ChatGPT 22.6%	—	—	Prompt Type P1 (toxicity translation template)	Table 4 reports per-model success rates on P1	Table 4
Robustness (follow explicit instruction) (P1)	ChatGLM2-6B 0.0%, BELLE-7B-2M 41.8%, ChatGPT 1.4%	—	—	Prompt Type P1	Table 4 robustness column for P1	Table 4

What To Try In 7 Days

Run the provided Latent Jailbreak prompts (or variants) against your model to get a baseline on jailbreak success and robustness.

Standardize prompts: put the task instruction first and add explicit cue words like 'sentence' to reduce hidden-instruction execution.

Use the paper's two-stage labeling approach: spot-check human labels and then fine-tune a classifier (RoBERTa) to scale labeling of generations.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/qiuhuachuan/latent-jailbreak

Data URLs

https://github.com/qiuhuachuan/latent-jailbreak

Risks & Boundaries

Limitations

Dataset focuses on toxic instructions and protected-group templates; other malicious goals (e.g., misinformation, scams) are not covered.

Experiments test three models only (two open models and GPT-3.5), so results may not generalize to larger or different architectures.

When Not To Use

Do not assume passing this benchmark guarantees real-world safety against other jailbreak styles or non-toxic attacks.

Not suitable for multimodal inputs or voice/visual prompt-injection testing.

Failure Modes

Over-refusal: an aligned model rejects benign tasks and loses utility (observed for ChatGPT).

Position sensitivity: models may ignore suffix instructions and execute hidden malicious text.

Core Entities

Models

ChatGLM2-6BBELLE-7B-2MChatGPT (GPT-3.5-turbo-0613)

Metrics

Jailbreak success rateRobustness (% following explicit instruction)Trustworthiness (combined safety+robustness metric)

Datasets

Latent Jailbreak prompt dataset (416 prompt variants)

Benchmarks

Latent Jailbreak benchmark

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Different models have very different jailbreak success rates on the same prompts.

Including a clear cue word (e.g., 'sentence') sharply reduces jailbreaks for some models.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding