JailBreakV-28K: 28,000 multimodal jailbreak tests show text-based LLM jailbreaks transfer to MLLMs

Overview

Decision SnapshotReady For Pilot

The dataset and experiments convincingly show transfer risk on many open-source MLLMs, but evaluation relies on an automated judge (Llama-Guard) and open models only, so findings are strong for open-source systems but not definitive for closed commercial models.

Citations5

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 100%

Novelty: 70%

Authors

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao

Links

Abstract / PDF / Data

Why It Matters For Business

Multimodal products inherit text-side jailbreak risks: hostile text prompts can bypass visual defenses and cause unsafe outputs, so safety pipelines must screen and harden text handling as well as images.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

The authors release JailBreakV-28K, a 28k test-suite that probes whether jailbreak techniques that break text-only LLMs also break multimodal LLMs (MLLMs). They build RedTeam-2K (2,000 harmful queries), generate 20k text-transfer attacks and 8k image-based attacks, and test 10 open-source MLLMs. Key results: LLM-derived text attacks succeed much more often than image attacks (average ASR ≈50.5% vs ≤30%), LLM text encoders show very high ASR (≈68.7%), and Malware/Economic Harm topics are the weakest. The benchmark and dataset are available with controlled access for research.

Problem Statement

Do jailbreak strategies that succeed on text-only LLMs also work on multimodal LLMs? The paper asks whether MLLMs inherit LLM vulnerabilities, and quantifies how image input, attack method, and safety topic affect jailbreak success.

Main Contribution

RedTeam-2K: a curated set of 2,000 harmful queries spanning 16 safety policies.

JailBreakV-28K: 28,000 multimodal test cases (20k LLM-transfer text attacks + 8k image-based attacks).

Key Findings

LLM-origin text jailbreaks transfer to MLLMs with high success

NumbersAverage ASR of LLM-transfer attacks on 10 MLLMs = 50.5%

Practical UseDefenses must handle malicious text prompts as core risk for MLLMs, not only image attacks.

Evidence RefTable 2; §4.2

Overall benchmark shows substantial vulnerability

NumbersAverage ASR across whole JailBreakV-28K = 44%

Practical UseExpect nearly half of adversarial cases to succeed on tested open-source MLLMs; prioritize remediation and monitoring.

Evidence RefTable 2; §4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ASR of LLM-transfer attacks on MLLMs (average)	50.5%	—	—	JailBreakV-28K (20k text-transfer cases)	Average across 10 open-source MLLMs	Table 2
ASR across entire benchmark (overall average)	44%	—	—	JailBreakV-28K (28k cases)	Overall average reported	§4.2

What To Try In 7 Days

Run a focused audit: feed top harmful query types from RedTeam-2K through your MLLM to measure ASR.

Add a text-side guard (e.g., toxicity and instruction filters) upstream of vision fusion.

Monitor and block queries targeting Malware and Economic Harm topics first.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/datasets/JailbreakV-28K/JailBreakV-28k https://eddyluo1232.github.io/JailBreakV28K/

Risks & Boundaries

Limitations

Evaluation uses Llama-Guard as automated judge; judge errors can bias ASR.

Benchmark focuses on open-source MLLMs; results may not generalize to closed commercial models.

When Not To Use

Do not release prompts or cases publicly in production that could be misused.

Not a replacement for human red-teaming; use as part of a broader safety program.

Failure Modes

Llama-Guard false negatives/positives change measured ASR.

MLLMs with stronger proprietary safety layers may behave differently.

Core Entities

Models

LLaVA-1.5-7BLLaVA-1.5-13BInstructBLIP-7BInstructBLIP-13BQwen-VL-ChatLLaMA-Adapter-v2OmniLMM-12BInfiMM-Zephyr-7BInternLM-XComposer2-VL-7BBunny-v1Llama-2Vicuna-7BVicuna-13BQwen1.5-7Bphi-2Zephyr-7BBaichuan-7BChatGLM3-6BMixtral-8x7BInternLM2-7B

Metrics

Attack Success Rate (ASR)

Datasets

JailBreakV-28KRedTeam-2KSafeBenchMM-SafetyBenchAdvBenchBeaverTailshh-rlhfImageNet

JailBreakV-28K: 28,000 multimodal jailbreak tests show text-based LLM jailbreaks transfer to MLLMs

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM-origin text jailbreaks transfer to MLLMs with high success

Overall benchmark shows substantial vulnerability

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM-origin text jailbreaks transfer to MLLMs with high success

Overall benchmark shows substantial vulnerability

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding