JailBreakV-28K: 28,000 multimodal jailbreak tests show text-based LLM jailbreaks transfer to MLLMs

April 3, 20247 min

Overview

Decision SnapshotReady For Pilot

The dataset and experiments convincingly show transfer risk on many open-source MLLMs, but evaluation relies on an automated judge (Llama-Guard) and open models only, so findings are strong for open-source systems but not definitive for closed commercial models.

Citations5

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 100%

Novelty: 70%

Authors

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao

Links

Abstract / PDF / Data

Why It Matters For Business

Multimodal products inherit text-side jailbreak risks: hostile text prompts can bypass visual defenses and cause unsafe outputs, so safety pipelines must screen and harden text handling as well as images.

Who Should Care

Summary TLDR

The authors release JailBreakV-28K, a 28k test-suite that probes whether jailbreak techniques that break text-only LLMs also break multimodal LLMs (MLLMs). They build RedTeam-2K (2,000 harmful queries), generate 20k text-transfer attacks and 8k image-based attacks, and test 10 open-source MLLMs. Key results: LLM-derived text attacks succeed much more often than image attacks (average ASR ≈50.5% vs ≤30%), LLM text encoders show very high ASR (≈68.7%), and Malware/Economic Harm topics are the weakest. The benchmark and dataset are available with controlled access for research.

Problem Statement

Do jailbreak strategies that succeed on text-only LLMs also work on multimodal LLMs? The paper asks whether MLLMs inherit LLM vulnerabilities, and quantifies how image input, attack method, and safety topic affect jailbreak success.

Main Contribution

RedTeam-2K: a curated set of 2,000 harmful queries spanning 16 safety policies.

JailBreakV-28K: 28,000 multimodal test cases (20k LLM-transfer text attacks + 8k image-based attacks).

Key Findings

LLM-origin text jailbreaks transfer to MLLMs with high success

NumbersAverage ASR of LLM-transfer attacks on 10 MLLMs = 50.5%

Practical UseDefenses must handle malicious text prompts as core risk for MLLMs, not only image attacks.

Evidence RefTable 2; §4.2

Overall benchmark shows substantial vulnerability

NumbersAverage ASR across whole JailBreakV-28K = 44%

Practical UseExpect nearly half of adversarial cases to succeed on tested open-source MLLMs; prioritize remediation and monitoring.

Evidence RefTable 2; §4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ASR of LLM-transfer attacks on MLLMs (average)50.5%JailBreakV-28K (20k text-transfer cases)Average across 10 open-source MLLMsTable 2
ASR across entire benchmark (overall average)44%JailBreakV-28K (28k cases)Overall average reported§4.2

What To Try In 7 Days

Run a focused audit: feed top harmful query types from RedTeam-2K through your MLLM to measure ASR.

Add a text-side guard (e.g., toxicity and instruction filters) upstream of vision fusion.

Monitor and block queries targeting Malware and Economic Harm topics first.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation uses Llama-Guard as automated judge; judge errors can bias ASR.

Benchmark focuses on open-source MLLMs; results may not generalize to closed commercial models.

When Not To Use

Do not release prompts or cases publicly in production that could be misused.

Not a replacement for human red-teaming; use as part of a broader safety program.

Failure Modes

Llama-Guard false negatives/positives change measured ASR.

MLLMs with stronger proprietary safety layers may behave differently.

Core Entities

Models

LLaVA-1.5-7BLLaVA-1.5-13BInstructBLIP-7BInstructBLIP-13BQwen-VL-ChatLLaMA-Adapter-v2OmniLMM-12BInfiMM-Zephyr-7BInternLM-XComposer2-VL-7BBunny-v1Llama-2Vicuna-7BVicuna-13BQwen1.5-7Bphi-2Zephyr-7BBaichuan-7BChatGLM3-6BMixtral-8x7BInternLM2-7B

Metrics

Attack Success Rate (ASR)

Datasets

JailBreakV-28KRedTeam-2KSafeBenchMM-SafetyBenchAdvBenchBeaverTailshh-rlhfImageNet

Benchmarks

JailBreakV-28KMM-SafetyBenchSafeBench

Context Entities

Models

GPT-based generators (used to create queries)Llama Guard (evaluation judge)