Benchmark: Vision LLMs handle odd images but break on counterfactual text and simple ViT attacks

November 27, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

4

Authors

Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie

Links

Abstract / PDF

Why It Matters For Business

If you deploy image+text models, simple visual attacks and text changes can break behavior; test both inputs and add safety-aware visual instruction tuning before release.

Summary TLDR

The authors build a safety benchmark for Vision LLMs (VLLMs) covering out-of-distribution (OOD) VQA and red-teaming attacks. They release two OOD datasets (OODCV-VQA and Sketchy-VQA, each with a harder variant) and adversarial test sets for vision and language attacks. Key takeaways: VLLMs often read unusual images well (yes/no), but fail when the text is counterfactual; simple CLIP-based image perturbations can mislead many VLLMs; sketch images are hard; vision-only jailbreaking is limited in transfer; vision-language tuning can weaken LLM safety. The authors evaluate 21 models including GPT-4V and open-source VLLMs and release code/data.

Problem Statement

VLLMs are rapidly deployed for image+text tasks, but we lack a focused safety benchmark that tests both out-of-distribution visual cases and adversarial/jailbreak attacks on visual and language inputs. The paper fills this gap with new OOD datasets and redteaming attacks to measure safety weaknesses in current VLLMs.

Main Contribution

A safety benchmark with two OOD VQA datasets (OODCV-VQA, Sketchy-VQA) and harder counterfactual / rare-object variants.

Two straightforward CLIP-ViT based adversarial attacks (SIN.ATTACK and MIX.ATTACK) and transfer/jailbreak evaluations.

Large-scale evaluation of 21 models (open VLLMs and GPT-4V) with quantitative results on OOD generalization, sketch recognition, adversarial misleading rate, and jailbreak transferability.

Public release of datasets and code at the project GitHub for reproducible safety testing.

Key Findings

VLLMs answer OOD visual yes/no questions very well but fail when text is counterfactual.

NumbersYes/No accuracy >=95% on OOD images; counterfactual overall drop 17.1%, Yes/No drop 33.2% (Table 5)

Sketch images with minimal detail cause consistent recognition failures.

NumbersBest F1 <70% on sketch task; rare-category F1 drops ~4.4% (Sec. 4.1.2)

Simple CLIP-ViT attacks can mislead many VLLMs; GPT-4V often rejects instead of hallucinating.

NumbersMIX.ATTACK improved misleading rate by ~8.4% and SIN.ATTACK by ~5.0% over ATTACKBARD at ϵ=64/255; GPT-4V shows higher拒(=

Vision-only jailbreaking has limited universal transfer but can succeed on targeted models.

NumbersTargeted attacks produced ~2.1× more toxic outputs on specific targets but only ~5% average toxic increase across models

Vision-language fine-tuning tends to weaken some LLM safety behaviors.

NumbersVLLMs show +5.5% (vanilla) and +17.3% (white-box) higher attack success than base LLMs; LLM→VLLM transfer increases ASR

Results

Accuracy

ValueGPT-4V 80.61%

Baselinemany open VLLMs 50–76% (varies by model)

Accuracy

ValueMost VLLMs >=95% on Yes/No

Counterfactual text impact

ValueAverage overall score drop 17.1%; Yes/No drop 33.2%

BaselineOriginal OODCV-VQA

Misleading/missing rate (vision attack)

ValueMIX.ATTACK improves misleading rate by ~8.4% vs ATTACKBARD at ϵ=64/255

BaselineATTACKBARD at same setting

Vision-language tuning effect on jailbreak ASR

ValueVLLMs white-box ASR 94.8% vs LLMs 77.5%

BaselineLLM

Who Should Care

What To Try In 7 Days

Run OODCV-VQA and Sketchy-VQA on your VLLM to spot weaknesses in counting, sketches, and counterfactual text.

Apply CLIP-based SIN/MIX attacks to a small image sample to check if your system hallucinates or rejects.

Re-evaluate safety rules after any vision-language fine-tuning and add counterfactual/text-robust examples to safety data.

Agent Features

Architectures

  • Vision-Language Models
  • CLIP-based ViT connectors

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • GPT-4V was evaluated only on selected challenging subsets, not full benchmark.
  • Counterfactual questions are template-generated, which may differ from human-written counterfactuals.
  • Vision attacks were tuned on CLIP ViT-L; transferability to other vision backbones can vary.
  • Toxicity judgments rely on Perspective API and GPT-3.5 classifiers, which inject judge bias.

When Not To Use

  • Not a coverage test for generative image synthesis or creative multimodal tasks.
  • Not a definitive proof of robustness—only diagnostic for the tested attack families and OOD types.
  • Not focused on long-horizon agent behavior or multi-step tool use safety.

Failure Modes

  • Models may refuse to answer (rejection) which skews 'misleading' vs 'rejection' metrics.
  • Template-based counterfactuals can produce artificial failure modes not seen in real user input.
  • Adversarial images tuned on one encoder may overestimate risk for systems using different vision backbones.
  • Toxicity detection errors (false positives/negatives) from automated classifiers.

Core Entities

Models

  • GPT-4V
  • InstructBLIP
  • LLaVA
  • MiniGPT4
  • Qwen-VL-Chat
  • CogVLM
  • InternLM-X
  • PandaGPT
  • Fuyu
  • Vicuna (various)
  • LLaMA-Adapter
  • mPLUG-Owl

Metrics

  • Accuracy
  • F1 (sketch recognition)
  • Misleading / missing rate
  • Attack Success Rate (ASR)
  • Rejection rate
  • Toxicity score (Perspective API)

Datasets

  • OODCV-VQA
  • OODCV-Counterfactual
  • Sketchy-VQA
  • Sketchy-Challenging
  • NIPS17 200-image set (misleading attacks)

Benchmarks

  • VLLM safety benchmark (OOD + redteaming)
  • Misleading-rate benchmark
  • Jailbreak ASR benchmark