JailBreakV-28K: 28,000 multimodal jailbreak tests show text-based LLM jailbreaks transfer to MLLMs

April 3, 20247 min

Overview

Production Readiness

1

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

5

Authors

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao

Links

Abstract / PDF

Why It Matters For Business

Multimodal products inherit text-side jailbreak risks: hostile text prompts can bypass visual defenses and cause unsafe outputs, so safety pipelines must screen and harden text handling as well as images.

Summary TLDR

The authors release JailBreakV-28K, a 28k test-suite that probes whether jailbreak techniques that break text-only LLMs also break multimodal LLMs (MLLMs). They build RedTeam-2K (2,000 harmful queries), generate 20k text-transfer attacks and 8k image-based attacks, and test 10 open-source MLLMs. Key results: LLM-derived text attacks succeed much more often than image attacks (average ASR ≈50.5% vs ≤30%), LLM text encoders show very high ASR (≈68.7%), and Malware/Economic Harm topics are the weakest. The benchmark and dataset are available with controlled access for research.

Problem Statement

Do jailbreak strategies that succeed on text-only LLMs also work on multimodal LLMs? The paper asks whether MLLMs inherit LLM vulnerabilities, and quantifies how image input, attack method, and safety topic affect jailbreak success.

Main Contribution

RedTeam-2K: a curated set of 2,000 harmful queries spanning 16 safety policies.

JailBreakV-28K: 28,000 multimodal test cases (20k LLM-transfer text attacks + 8k image-based attacks).

Systematic evaluation of 10 open-source MLLMs showing high transferability of text jailbreaks.

Analysis showing topics (Malware, Economic Harm) are especially vulnerable, and image type has little effect on strong text-based attacks.

Key Findings

LLM-origin text jailbreaks transfer to MLLMs with high success

NumbersAverage ASR of LLM-transfer attacks on 10 MLLMs = 50.5%

Overall benchmark shows substantial vulnerability

NumbersAverage ASR across whole JailBreakV-28K = 44%

MLLMs inherit LLM encoder weaknesses

NumbersAverage ASR on MLLM text encoders = 68.7%

Text attacks beat image attacks in current state-of-the-art

NumbersLLM-transfer text ASR = 50.5% vs image-based attacks max ≈30%

Some safety topics are far more attack-prone

NumbersAverage ASR for Malware = 57.9%, Economic Harm = 53.1%

Results

ASR of LLM-transfer attacks on MLLMs (average)

Value50.5%

ASR across entire benchmark (overall average)

Value44%

ASR on MLLM text encoders

Value68.7%

ASR for image-based MLLM attacks (best)

Value≈30%

ASR for Malware safety policy (average across models)

Value57.9%

ASR for Economic Harm safety policy (average across models)

Value53.1%

Who Should Care

What To Try In 7 Days

Run a focused audit: feed top harmful query types from RedTeam-2K through your MLLM to measure ASR.

Add a text-side guard (e.g., toxicity and instruction filters) upstream of vision fusion.

Monitor and block queries targeting Malware and Economic Harm topics first.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation uses Llama-Guard as automated judge; judge errors can bias ASR.
  • Benchmark focuses on open-source MLLMs; results may not generalize to closed commercial models.
  • Some jailbreak methods (multilingual cognitive overload) were excluded, reducing language coverage.
  • Access to full benchmark requires permission; not fully public by default.

When Not To Use

  • Do not release prompts or cases publicly in production that could be misused.
  • Not a replacement for human red-teaming; use as part of a broader safety program.

Failure Modes

  • Llama-Guard false negatives/positives change measured ASR.
  • MLLMs with stronger proprietary safety layers may behave differently.
  • Attack transferability depends on generation LLMs and tuning choices used to craft prompts.

Core Entities

Models

  • LLaVA-1.5-7B
  • LLaVA-1.5-13B
  • InstructBLIP-7B
  • InstructBLIP-13B
  • Qwen-VL-Chat
  • LLaMA-Adapter-v2
  • OmniLMM-12B
  • InfiMM-Zephyr-7B
  • InternLM-XComposer2-VL-7B
  • Bunny-v1
  • Llama-2
  • Vicuna-7B
  • Vicuna-13B
  • Qwen1.5-7B
  • phi-2
  • Zephyr-7B
  • Baichuan-7B
  • ChatGLM3-6B
  • Mixtral-8x7B
  • InternLM2-7B

Metrics

  • Attack Success Rate (ASR)

Datasets

  • JailBreakV-28K
  • RedTeam-2K
  • SafeBench
  • MM-SafetyBench
  • AdvBench
  • BeaverTails
  • hh-rlhf
  • ImageNet

Benchmarks

  • JailBreakV-28K
  • MM-SafetyBench
  • SafeBench

Context Entities

Models

  • GPT-based generators (used to create queries)
  • Llama Guard (evaluation judge)