Overview
The dataset and experiments convincingly show transfer risk on many open-source MLLMs, but evaluation relies on an automated judge (Llama-Guard) and open models only, so findings are strong for open-source systems but not definitive for closed commercial models.
Citations5
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/6
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 100%
Novelty: 70%
Why It Matters For Business
Multimodal products inherit text-side jailbreak risks: hostile text prompts can bypass visual defenses and cause unsafe outputs, so safety pipelines must screen and harden text handling as well as images.
Who Should Care
Summary TLDR
The authors release JailBreakV-28K, a 28k test-suite that probes whether jailbreak techniques that break text-only LLMs also break multimodal LLMs (MLLMs). They build RedTeam-2K (2,000 harmful queries), generate 20k text-transfer attacks and 8k image-based attacks, and test 10 open-source MLLMs. Key results: LLM-derived text attacks succeed much more often than image attacks (average ASR ≈50.5% vs ≤30%), LLM text encoders show very high ASR (≈68.7%), and Malware/Economic Harm topics are the weakest. The benchmark and dataset are available with controlled access for research.
Problem Statement
Do jailbreak strategies that succeed on text-only LLMs also work on multimodal LLMs? The paper asks whether MLLMs inherit LLM vulnerabilities, and quantifies how image input, attack method, and safety topic affect jailbreak success.
Main Contribution
RedTeam-2K: a curated set of 2,000 harmful queries spanning 16 safety policies.
JailBreakV-28K: 28,000 multimodal test cases (20k LLM-transfer text attacks + 8k image-based attacks).
Key Findings
LLM-origin text jailbreaks transfer to MLLMs with high success
Overall benchmark shows substantial vulnerability
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ASR of LLM-transfer attacks on MLLMs (average) | 50.5% | — | — | JailBreakV-28K (20k text-transfer cases) | Average across 10 open-source MLLMs | Table 2 |
| ASR across entire benchmark (overall average) | 44% | — | — | JailBreakV-28K (28k cases) | Overall average reported | §4.2 |
What To Try In 7 Days
Run a focused audit: feed top harmful query types from RedTeam-2K through your MLLM to measure ASR.
Add a text-side guard (e.g., toxicity and instruction filters) upstream of vision fusion.
Monitor and block queries targeting Malware and Economic Harm topics first.
Reproducibility
Risks & Boundaries
Limitations
Evaluation uses Llama-Guard as automated judge; judge errors can bias ASR.
Benchmark focuses on open-source MLLMs; results may not generalize to closed commercial models.
When Not To Use
Do not release prompts or cases publicly in production that could be misused.
Not a replacement for human red-teaming; use as part of a broader safety program.
Failure Modes
Llama-Guard false negatives/positives change measured ASR.
MLLMs with stronger proprietary safety layers may behave differently.

