Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
Investing in diverse, human-labeled vision tasks gives larger capability gains and less forgetting than mass synthetic labeling; a small alignment set (~1k GPT-4 examples) can deliver chat-style outputs while avoiding the cost and bias of large synthetic corpora.
Summary TLDR
The authors release VISION-FLAN, a public visual instruction tuning dataset of 187 tasks and 1,664,261 instances built from academic datasets and expert-written instructions. They propose two-stage tuning: first fine-tune a VLM on VISION-FLAN for broad capabilities, then a light second-stage (1,000 GPT-4-synthesized examples) to align responses to human-preferred formats. Results show task diversity (human-labeled) raises multi-benchmark scores and reduces catastrophic forgetting. Large-scale GPT-4 synthetic data gives little capability gain and can add hallucination and 'yes' bias.
Problem Statement
Current visual instruction tuning relies heavily on GPT-4 synthesized data and caption-style pretraining. That yields narrow task coverage, poor generalization on diverse vision tasks (e.g., OCR), annotation bias and hallucination from synthetic labels, and catastrophic forgetting of basic detection tasks.
Main Contribution
VISION-FLAN: a public dataset of 187 human-labeled visual tasks (1,664,261 instances) with expert-written instructions
A two-stage visual instruction tuning recipe: (1) fine-tune on VISION-FLAN, (2) brief GPT-4-based alignment (1k examples)
Empirical claim: task diversity from human labels improves capabilities and reduces catastrophic forgetting
Analysis showing small GPT-4 tuning (≈1,000 examples) aligns style, while large GPT-4 data adds bias and hallucination
Key Findings
VISION-FLAN is large and diverse: 187 tasks and 1,664,261 instances.
VISION-FLAN BASE (trained only on human-labeled tasks) achieves top performance on comprehensive benchmarks.
A brief second-stage using 1,000 GPT-4-synthesized examples sharply improves human-preference alignment.
Scaling the number of human-labeled tasks improves performance more than scaling instances per task when total instances are fixed.
Large amounts of GPT-4-synthesized data do not increase core capability and raise hallucination and 'yes' bias.
Results
MM-Bench
MME
LLaVA-Bench (human-preference)
Catastrophic Forgetting (CF averaged)
Two-stage vs mixed fine-tuning
Who Should Care
What To Try In 7 Days
Audit your visual training data: count distinct task types and add missing ones (OCR, detection, domain tests).
Fine-tune your VLM for one epoch on a small diverse human-labeled task set to boost generalization.
Run a 1,000-example GPT-4 alignment pass and compare human-preference metrics to heavy synthetic tuning.
Agent Features
Architectures
- LLaVA-Architecture
- Vicuna-13B v1.5
- CLIP-ViT-L-336px
Optimization Features
Training Optimization
- two-stage fine-tuning (human tasks then small GPT-4 alignment)
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- All tasks and instructions are English-only, limiting multilingual use.
- VISION-FLAN focuses on single-image tasks; multi-image or video scenarios are not covered.
- Experiments mainly use the LLaVA-Architecture, so results may vary for other bridging modules.
When Not To Use
- You need non-English or multilingual visual instruction tuning.
- Your application requires multi-image or video reasoning.
- You use a different VLM architecture without similar bridging modules.
Failure Modes
- Large-scale GPT-4 synthetic tuning can increase hallucinations and bias toward 'Yes' answers.
- Mixing large synthetic datasets with human-labeled data in a single-stage can worsen alignment and capability versus two-stage.
- If bridging MLPs and LLMs are not tuned appropriately, capability gains may be lost.
Core Entities
Models
- LLaVA-Architecture
- VISION-FLAN BASE
- VISION-FLAN CHAT
- Vicuna-13B v1.5
- CLIP-ViT-L-336px
- LLaMA 2 Chat
- BLIP-2
- InstructBLIP
- Shikra
- Qwen-VL
- LLaVA 1.5
Metrics
- MM-Bench score
- MME score
- LLaVA-Bench score (human-preference)
- POPE hallucination
- CF averaged score (catastrophic forgetting)
Datasets
- VISION-FLAN
- LLaVA (GPT-4 synthesized)
- MM-Bench
- MME
- MM-Vet
- LLaVA-Bench
- POPE
- MMMU
- CIFAR-10
- CIFAR-100
- MNIST
- miniImageNet
- TextOCR
Benchmarks
- MM-Bench
- MME
- LLaVA-Bench
- MM-Vet
- POPE
- MMMU
- catastrophic forgetting (CF) suite

