A 187-task human-labeled dataset (1.66M instances) + two-stage tuning that needs only 1k GPT-4 examples to align VLM outputs

February 18, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper gives multi-benchmark and ablation evidence that diverse human-labeled tasks improve capability and that a small GPT-4 alignment stage (~1k examples) is sufficient to change output style; results are shown across MM-Bench, MME, LLaVA-Bench, POPE and catastrophic-forgetting tests.

Citations2

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, Lifu Huang

Links

Abstract / PDF

Why It Matters For Business

Investing in diverse, human-labeled vision tasks gives larger capability gains and less forgetting than mass synthetic labeling; a small alignment set (~1k GPT-4 examples) can deliver chat-style outputs while avoiding the cost and bias of large synthetic corpora.

Who Should Care

Summary TLDR

The authors release VISION-FLAN, a public visual instruction tuning dataset of 187 tasks and 1,664,261 instances built from academic datasets and expert-written instructions. They propose two-stage tuning: first fine-tune a VLM on VISION-FLAN for broad capabilities, then a light second-stage (1,000 GPT-4-synthesized examples) to align responses to human-preferred formats. Results show task diversity (human-labeled) raises multi-benchmark scores and reduces catastrophic forgetting. Large-scale GPT-4 synthetic data gives little capability gain and can add hallucination and 'yes' bias.

Problem Statement

Current visual instruction tuning relies heavily on GPT-4 synthesized data and caption-style pretraining. That yields narrow task coverage, poor generalization on diverse vision tasks (e.g., OCR), annotation bias and hallucination from synthetic labels, and catastrophic forgetting of basic detection tasks.

Main Contribution

VISION-FLAN: a public dataset of 187 human-labeled visual tasks (1,664,261 instances) with expert-written instructions

A two-stage visual instruction tuning recipe: (1) fine-tune on VISION-FLAN, (2) brief GPT-4-based alignment (1k examples)

Key Findings

VISION-FLAN is large and diverse: 187 tasks and 1,664,261 instances.

Numbers187 tasks; 1,664,261 instances

Practical UseUse diverse, task-level human labels to cover perception, OCR, reasoning, and domain tasks rather than relying solely on caption-style synthetic data.

Evidence RefSection 2, Table 1

VISION-FLAN BASE (trained only on human-labeled tasks) achieves top performance on comprehensive benchmarks.

NumbersMM-Bench 69.8; MME 1537.8 (Table 2)

Practical UsePrioritize time and budget for collecting diverse human-labeled task data to raise model capability across many vision benchmarks.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MM-Bench69.8 (VISION-FLAN BASE)LLaVA 1.5 66.7+3.1Table 2 aggregatedTable 2: VISION-FLAN BASE MM-Bench 69.8 vs LLaVA1.5 66.7Table 2
MME1537.8 (VISION-FLAN BASE)LLaVA 1.5 1531.3+6.5Table 2 aggregatedTable 2: VISION-FLAN BASE MME 1537.8 vs LLaVA1.5 1531.3Table 2

What To Try In 7 Days

Audit your visual training data: count distinct task types and add missing ones (OCR, detection, domain tests).

Fine-tune your VLM for one epoch on a small diverse human-labeled task set to boost generalization.

Run a 1,000-example GPT-4 alignment pass and compare human-preference metrics to heavy synthetic tuning.

Agent Features

Architectures
LLaVA-ArchitectureVicuna-13B v1.5CLIP-ViT-L-336px

Optimization Features

Training Optimization
two-stage fine-tuning (human tasks then small GPT-4 alignment)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

All tasks and instructions are English-only, limiting multilingual use.

VISION-FLAN focuses on single-image tasks; multi-image or video scenarios are not covered.

When Not To Use

You need non-English or multilingual visual instruction tuning.

Your application requires multi-image or video reasoning.

Failure Modes

Large-scale GPT-4 synthetic tuning can increase hallucinations and bias toward 'Yes' answers.

Mixing large synthetic datasets with human-labeled data in a single-stage can worsen alignment and capability versus two-stage.

Core Entities

Models

LLaVA-ArchitectureVISION-FLAN BASEVISION-FLAN CHATVicuna-13B v1.5CLIP-ViT-L-336pxLLaMA 2 ChatBLIP-2InstructBLIPShikraQwen-VLLLaVA 1.5

Metrics

MM-Bench scoreMME scoreLLaVA-Bench score (human-preference)POPE hallucinationCF averaged score (catastrophic forgetting)

Datasets

VISION-FLANLLaVA (GPT-4 synthesized)MM-BenchMMEMM-VetLLaVA-BenchPOPEMMMUCIFAR-10CIFAR-100MNISTminiImageNetTextOCR

Benchmarks

MM-BenchMMELLaVA-BenchMM-VetPOPEMMMUcatastrophic forgetting (CF) suite