Overview
The paper gives multi-benchmark and ablation evidence that diverse human-labeled tasks improve capability and that a small GPT-4 alignment stage (~1k examples) is sufficient to change output style; results are shown across MM-Bench, MME, LLaVA-Bench, POPE and catastrophic-forgetting tests.
Citations2
Evidence Strength0.70
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Investing in diverse, human-labeled vision tasks gives larger capability gains and less forgetting than mass synthetic labeling; a small alignment set (~1k GPT-4 examples) can deliver chat-style outputs while avoiding the cost and bias of large synthetic corpora.
Who Should Care
Summary TLDR
The authors release VISION-FLAN, a public visual instruction tuning dataset of 187 tasks and 1,664,261 instances built from academic datasets and expert-written instructions. They propose two-stage tuning: first fine-tune a VLM on VISION-FLAN for broad capabilities, then a light second-stage (1,000 GPT-4-synthesized examples) to align responses to human-preferred formats. Results show task diversity (human-labeled) raises multi-benchmark scores and reduces catastrophic forgetting. Large-scale GPT-4 synthetic data gives little capability gain and can add hallucination and 'yes' bias.
Problem Statement
Current visual instruction tuning relies heavily on GPT-4 synthesized data and caption-style pretraining. That yields narrow task coverage, poor generalization on diverse vision tasks (e.g., OCR), annotation bias and hallucination from synthetic labels, and catastrophic forgetting of basic detection tasks.
Main Contribution
VISION-FLAN: a public dataset of 187 human-labeled visual tasks (1,664,261 instances) with expert-written instructions
A two-stage visual instruction tuning recipe: (1) fine-tune on VISION-FLAN, (2) brief GPT-4-based alignment (1k examples)
Key Findings
VISION-FLAN is large and diverse: 187 tasks and 1,664,261 instances.
VISION-FLAN BASE (trained only on human-labeled tasks) achieves top performance on comprehensive benchmarks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| MM-Bench | 69.8 (VISION-FLAN BASE) | LLaVA 1.5 66.7 | +3.1 | Table 2 aggregated | Table 2: VISION-FLAN BASE MM-Bench 69.8 vs LLaVA1.5 66.7 | Table 2 |
| MME | 1537.8 (VISION-FLAN BASE) | LLaVA 1.5 1531.3 | +6.5 | Table 2 aggregated | Table 2: VISION-FLAN BASE MME 1537.8 vs LLaVA1.5 1531.3 | Table 2 |
What To Try In 7 Days
Audit your visual training data: count distinct task types and add missing ones (OCR, detection, domain tests).
Fine-tune your VLM for one epoch on a small diverse human-labeled task set to boost generalization.
Run a 1,000-example GPT-4 alignment pass and compare human-preference metrics to heavy synthetic tuning.
Agent Features
Architectures
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
All tasks and instructions are English-only, limiting multilingual use.
VISION-FLAN focuses on single-image tasks; multi-image or video scenarios are not covered.
When Not To Use
You need non-English or multilingual visual instruction tuning.
Your application requires multi-image or video reasoning.
Failure Modes
Large-scale GPT-4 synthetic tuning can increase hallucinations and bias toward 'Yes' answers.
Mixing large synthetic datasets with human-labeled data in a single-stage can worsen alignment and capability versus two-stage.

