Overview
Production Readiness
1
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
5
Why It Matters For Business
Multimodal products inherit text-side jailbreak risks: hostile text prompts can bypass visual defenses and cause unsafe outputs, so safety pipelines must screen and harden text handling as well as images.
Summary TLDR
The authors release JailBreakV-28K, a 28k test-suite that probes whether jailbreak techniques that break text-only LLMs also break multimodal LLMs (MLLMs). They build RedTeam-2K (2,000 harmful queries), generate 20k text-transfer attacks and 8k image-based attacks, and test 10 open-source MLLMs. Key results: LLM-derived text attacks succeed much more often than image attacks (average ASR ≈50.5% vs ≤30%), LLM text encoders show very high ASR (≈68.7%), and Malware/Economic Harm topics are the weakest. The benchmark and dataset are available with controlled access for research.
Problem Statement
Do jailbreak strategies that succeed on text-only LLMs also work on multimodal LLMs? The paper asks whether MLLMs inherit LLM vulnerabilities, and quantifies how image input, attack method, and safety topic affect jailbreak success.
Main Contribution
RedTeam-2K: a curated set of 2,000 harmful queries spanning 16 safety policies.
JailBreakV-28K: 28,000 multimodal test cases (20k LLM-transfer text attacks + 8k image-based attacks).
Systematic evaluation of 10 open-source MLLMs showing high transferability of text jailbreaks.
Analysis showing topics (Malware, Economic Harm) are especially vulnerable, and image type has little effect on strong text-based attacks.
Key Findings
LLM-origin text jailbreaks transfer to MLLMs with high success
Overall benchmark shows substantial vulnerability
MLLMs inherit LLM encoder weaknesses
Text attacks beat image attacks in current state-of-the-art
Some safety topics are far more attack-prone
Results
ASR of LLM-transfer attacks on MLLMs (average)
ASR across entire benchmark (overall average)
ASR on MLLM text encoders
ASR for image-based MLLM attacks (best)
ASR for Malware safety policy (average across models)
ASR for Economic Harm safety policy (average across models)
Who Should Care
What To Try In 7 Days
Run a focused audit: feed top harmful query types from RedTeam-2K through your MLLM to measure ASR.
Add a text-side guard (e.g., toxicity and instruction filters) upstream of vision fusion.
Monitor and block queries targeting Malware and Economic Harm topics first.
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation uses Llama-Guard as automated judge; judge errors can bias ASR.
- Benchmark focuses on open-source MLLMs; results may not generalize to closed commercial models.
- Some jailbreak methods (multilingual cognitive overload) were excluded, reducing language coverage.
- Access to full benchmark requires permission; not fully public by default.
When Not To Use
- Do not release prompts or cases publicly in production that could be misused.
- Not a replacement for human red-teaming; use as part of a broader safety program.
Failure Modes
- Llama-Guard false negatives/positives change measured ASR.
- MLLMs with stronger proprietary safety layers may behave differently.
- Attack transferability depends on generation LLMs and tuning choices used to craft prompts.
Core Entities
Models
- LLaVA-1.5-7B
- LLaVA-1.5-13B
- InstructBLIP-7B
- InstructBLIP-13B
- Qwen-VL-Chat
- LLaMA-Adapter-v2
- OmniLMM-12B
- InfiMM-Zephyr-7B
- InternLM-XComposer2-VL-7B
- Bunny-v1
- Llama-2
- Vicuna-7B
- Vicuna-13B
- Qwen1.5-7B
- phi-2
- Zephyr-7B
- Baichuan-7B
- ChatGLM3-6B
- Mixtral-8x7B
- InternLM2-7B
Metrics
- Attack Success Rate (ASR)
Datasets
- JailBreakV-28K
- RedTeam-2K
- SafeBench
- MM-SafetyBench
- AdvBench
- BeaverTails
- hh-rlhf
- ImageNet
Benchmarks
- JailBreakV-28K
- MM-SafetyBench
- SafeBench
Context Entities
Models
- GPT-based generators (used to create queries)
- Llama Guard (evaluation judge)

