Overview
Dataset and code are available and experiments are reproducible; evaluation covers multiple prompt templates and three models but is limited to toxicity templates, two languages, and a small model set.
Citations11
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/3
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
Hidden malicious instructions can either make models produce unsafe content or cause over-refusal and lost utility; you must measure both safety and instruction-following before deployment to avoid surprise failures.
Who Should Care
Summary TLDR
The paper introduces the Latent Jailbreak benchmark: synthetic prompts that hide malicious instructions inside otherwise normal tasks (e.g., translate this sentence, where the sentence itself tells the model to produce toxic text). They release 416 prompt variants (13 templates × protected-group tokens × 2 positions) and a hierarchical human+automatic labeling procedure to measure three things: jailbreak success (unsafe outputs), robustness (does the model follow the explicit normal instruction), and trustworthiness (combined metric). Tests on three models (ChatGLM2-6B, BELLE-7B-2M, GPT-3.5) show wide variation: some models are easy to jailbreak, others over-refuse and fail to follow benign,
Problem Statement
Current jailbreak benchmarks focus on whether a model outputs unsafe text, but ignore whether the model still follows the user's explicit instruction (output robustness). The gap: models can either be unsafe or over-refuse and lose competence. The paper builds a dataset and metrics to evaluate both safety and instruction-following robustness together.
Main Contribution
A latent jailbreak prompt dataset that embeds malicious instructions inside normal tasks and swaps instruction position (prefix vs suffix).
A hierarchical annotation framework (safety vs robustness) combining human labels and an automatic RoBERTa classifier.
Key Findings
Different models have very different jailbreak success rates on the same prompts.
Including a clear cue word (e.g., 'sentence') sharply reduces jailbreaks for some models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Jailbreak success rate (P1) | ChatGLM2-6B 75.3%, BELLE-7B-2M 50.4%, ChatGPT 22.6% | — | — | Prompt Type P1 (toxicity translation template) | Table 4 reports per-model success rates on P1 | Table 4 |
| Robustness (follow explicit instruction) (P1) | ChatGLM2-6B 0.0%, BELLE-7B-2M 41.8%, ChatGPT 1.4% | — | — | Prompt Type P1 | Table 4 robustness column for P1 | Table 4 |
What To Try In 7 Days
Run the provided Latent Jailbreak prompts (or variants) against your model to get a baseline on jailbreak success and robustness.
Standardize prompts: put the task instruction first and add explicit cue words like 'sentence' to reduce hidden-instruction execution.
Use the paper's two-stage labeling approach: spot-check human labels and then fine-tune a classifier (RoBERTa) to scale labeling of generations.
Reproducibility
Risks & Boundaries
Limitations
Dataset focuses on toxic instructions and protected-group templates; other malicious goals (e.g., misinformation, scams) are not covered.
Experiments test three models only (two open models and GPT-3.5), so results may not generalize to larger or different architectures.
When Not To Use
Do not assume passing this benchmark guarantees real-world safety against other jailbreak styles or non-toxic attacks.
Not suitable for multimodal inputs or voice/visual prompt-injection testing.
Failure Modes
Over-refusal: an aligned model rejects benign tasks and loses utility (observed for ChatGPT).
Position sensitivity: models may ignore suffix instructions and execute hidden malicious text.

