Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
4
Why It Matters For Business
If you deploy image+text models, simple visual attacks and text changes can break behavior; test both inputs and add safety-aware visual instruction tuning before release.
Summary TLDR
The authors build a safety benchmark for Vision LLMs (VLLMs) covering out-of-distribution (OOD) VQA and red-teaming attacks. They release two OOD datasets (OODCV-VQA and Sketchy-VQA, each with a harder variant) and adversarial test sets for vision and language attacks. Key takeaways: VLLMs often read unusual images well (yes/no), but fail when the text is counterfactual; simple CLIP-based image perturbations can mislead many VLLMs; sketch images are hard; vision-only jailbreaking is limited in transfer; vision-language tuning can weaken LLM safety. The authors evaluate 21 models including GPT-4V and open-source VLLMs and release code/data.
Problem Statement
VLLMs are rapidly deployed for image+text tasks, but we lack a focused safety benchmark that tests both out-of-distribution visual cases and adversarial/jailbreak attacks on visual and language inputs. The paper fills this gap with new OOD datasets and redteaming attacks to measure safety weaknesses in current VLLMs.
Main Contribution
A safety benchmark with two OOD VQA datasets (OODCV-VQA, Sketchy-VQA) and harder counterfactual / rare-object variants.
Two straightforward CLIP-ViT based adversarial attacks (SIN.ATTACK and MIX.ATTACK) and transfer/jailbreak evaluations.
Large-scale evaluation of 21 models (open VLLMs and GPT-4V) with quantitative results on OOD generalization, sketch recognition, adversarial misleading rate, and jailbreak transferability.
Public release of datasets and code at the project GitHub for reproducible safety testing.
Key Findings
VLLMs answer OOD visual yes/no questions very well but fail when text is counterfactual.
Sketch images with minimal detail cause consistent recognition failures.
Simple CLIP-ViT attacks can mislead many VLLMs; GPT-4V often rejects instead of hallucinating.
Vision-only jailbreaking has limited universal transfer but can succeed on targeted models.
Vision-language fine-tuning tends to weaken some LLM safety behaviors.
Results
Accuracy
Accuracy
Counterfactual text impact
Misleading/missing rate (vision attack)
Vision-language tuning effect on jailbreak ASR
Who Should Care
What To Try In 7 Days
Run OODCV-VQA and Sketchy-VQA on your VLLM to spot weaknesses in counting, sketches, and counterfactual text.
Apply CLIP-based SIN/MIX attacks to a small image sample to check if your system hallucinates or rejects.
Re-evaluate safety rules after any vision-language fine-tuning and add counterfactual/text-robust examples to safety data.
Agent Features
Architectures
- Vision-Language Models
- CLIP-based ViT connectors
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- GPT-4V was evaluated only on selected challenging subsets, not full benchmark.
- Counterfactual questions are template-generated, which may differ from human-written counterfactuals.
- Vision attacks were tuned on CLIP ViT-L; transferability to other vision backbones can vary.
- Toxicity judgments rely on Perspective API and GPT-3.5 classifiers, which inject judge bias.
When Not To Use
- Not a coverage test for generative image synthesis or creative multimodal tasks.
- Not a definitive proof of robustness—only diagnostic for the tested attack families and OOD types.
- Not focused on long-horizon agent behavior or multi-step tool use safety.
Failure Modes
- Models may refuse to answer (rejection) which skews 'misleading' vs 'rejection' metrics.
- Template-based counterfactuals can produce artificial failure modes not seen in real user input.
- Adversarial images tuned on one encoder may overestimate risk for systems using different vision backbones.
- Toxicity detection errors (false positives/negatives) from automated classifiers.
Core Entities
Models
- GPT-4V
- InstructBLIP
- LLaVA
- MiniGPT4
- Qwen-VL-Chat
- CogVLM
- InternLM-X
- PandaGPT
- Fuyu
- Vicuna (various)
- LLaMA-Adapter
- mPLUG-Owl
Metrics
- Accuracy
- F1 (sketch recognition)
- Misleading / missing rate
- Attack Success Rate (ASR)
- Rejection rate
- Toxicity score (Perspective API)
Datasets
- OODCV-VQA
- OODCV-Counterfactual
- Sketchy-VQA
- Sketchy-Challenging
- NIPS17 200-image set (misleading attacks)
Benchmarks
- VLLM safety benchmark (OOD + redteaming)
- Misleading-rate benchmark
- Jailbreak ASR benchmark

