Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
11
Why It Matters For Business
Hidden malicious instructions can either make models produce unsafe content or cause over-refusal and lost utility; you must measure both safety and instruction-following before deployment to avoid surprise failures.
Summary TLDR
The paper introduces the Latent Jailbreak benchmark: synthetic prompts that hide malicious instructions inside otherwise normal tasks (e.g., translate this sentence, where the sentence itself tells the model to produce toxic text). They release 416 prompt variants (13 templates × protected-group tokens × 2 positions) and a hierarchical human+automatic labeling procedure to measure three things: jailbreak success (unsafe outputs), robustness (does the model follow the explicit normal instruction), and trustworthiness (combined metric). Tests on three models (ChatGLM2-6B, BELLE-7B-2M, GPT-3.5) show wide variation: some models are easy to jailbreak, others over-refuse and fail to follow benign,
Problem Statement
Current jailbreak benchmarks focus on whether a model outputs unsafe text, but ignore whether the model still follows the user's explicit instruction (output robustness). The gap: models can either be unsafe or over-refuse and lose competence. The paper builds a dataset and metrics to evaluate both safety and instruction-following robustness together.
Main Contribution
A latent jailbreak prompt dataset that embeds malicious instructions inside normal tasks and swaps instruction position (prefix vs suffix).
A hierarchical annotation framework (safety vs robustness) combining human labels and an automatic RoBERTa classifier.
Systematic analyses showing how instruction position, cue words, verbs, target groups, and toxic adjectives affect jailbreak rates across three LLMs.
Key Findings
Different models have very different jailbreak success rates on the same prompts.
Including a clear cue word (e.g., 'sentence') sharply reduces jailbreaks for some models.
Instruction position matters: models are safer and follow instructions more when the explicit task instruction is a prefix.
Some models trade safety for usefulness by over-refusing and thus show low robustness.
Sensitivity varies by instruction verb and toxic adjective.
Results
Jailbreak success rate (P1)
Robustness (follow explicit instruction) (P1)
Effect of cue word 'sentence' (P2)
Who Should Care
What To Try In 7 Days
Run the provided Latent Jailbreak prompts (or variants) against your model to get a baseline on jailbreak success and robustness.
Standardize prompts: put the task instruction first and add explicit cue words like 'sentence' to reduce hidden-instruction execution.
Use the paper's two-stage labeling approach: spot-check human labels and then fine-tune a classifier (RoBERTa) to scale labeling of generations.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Dataset focuses on toxic instructions and protected-group templates; other malicious goals (e.g., misinformation, scams) are not covered.
- Experiments test three models only (two open models and GPT-3.5), so results may not generalize to larger or different architectures.
- Automatic labeling relies on a RoBERTa classifier fine-tuned on a P1 seed; classifier bias or labeling errors can affect aggregated metrics.
When Not To Use
- Do not assume passing this benchmark guarantees real-world safety against other jailbreak styles or non-toxic attacks.
- Not suitable for multimodal inputs or voice/visual prompt-injection testing.
- Not a direct replacement for adversarial red-teaming that targets system-level integrations.
Failure Modes
- Over-refusal: an aligned model rejects benign tasks and loses utility (observed for ChatGPT).
- Position sensitivity: models may ignore suffix instructions and execute hidden malicious text.
- Judge bias: automatic labels depend on the RoBERTa model fine-tuned on limited data, risking misclassification.
Core Entities
Models
- ChatGLM2-6B
- BELLE-7B-2M
- ChatGPT (GPT-3.5-turbo-0613)
Metrics
- Jailbreak success rate
- Robustness (% following explicit instruction)
- Trustworthiness (combined safety+robustness metric)
Datasets
- Latent Jailbreak prompt dataset (416 prompt variants)
Benchmarks
- Latent Jailbreak benchmark

