A 187-task human-labeled dataset (1.66M instances) + two-stage tuning that needs only 1k GPT-4 examples to align VLM outputs

February 18, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, Lifu Huang

Links

Abstract / PDF

Why It Matters For Business

Investing in diverse, human-labeled vision tasks gives larger capability gains and less forgetting than mass synthetic labeling; a small alignment set (~1k GPT-4 examples) can deliver chat-style outputs while avoiding the cost and bias of large synthetic corpora.

Summary TLDR

The authors release VISION-FLAN, a public visual instruction tuning dataset of 187 tasks and 1,664,261 instances built from academic datasets and expert-written instructions. They propose two-stage tuning: first fine-tune a VLM on VISION-FLAN for broad capabilities, then a light second-stage (1,000 GPT-4-synthesized examples) to align responses to human-preferred formats. Results show task diversity (human-labeled) raises multi-benchmark scores and reduces catastrophic forgetting. Large-scale GPT-4 synthetic data gives little capability gain and can add hallucination and 'yes' bias.

Problem Statement

Current visual instruction tuning relies heavily on GPT-4 synthesized data and caption-style pretraining. That yields narrow task coverage, poor generalization on diverse vision tasks (e.g., OCR), annotation bias and hallucination from synthetic labels, and catastrophic forgetting of basic detection tasks.

Main Contribution

VISION-FLAN: a public dataset of 187 human-labeled visual tasks (1,664,261 instances) with expert-written instructions

A two-stage visual instruction tuning recipe: (1) fine-tune on VISION-FLAN, (2) brief GPT-4-based alignment (1k examples)

Empirical claim: task diversity from human labels improves capabilities and reduces catastrophic forgetting

Analysis showing small GPT-4 tuning (≈1,000 examples) aligns style, while large GPT-4 data adds bias and hallucination

Key Findings

VISION-FLAN is large and diverse: 187 tasks and 1,664,261 instances.

Numbers187 tasks; 1,664,261 instances

VISION-FLAN BASE (trained only on human-labeled tasks) achieves top performance on comprehensive benchmarks.

NumbersMM-Bench 69.8; MME 1537.8 (Table 2)

A brief second-stage using 1,000 GPT-4-synthesized examples sharply improves human-preference alignment.

NumbersLLaVA-Bench: 38.5 -> 78.3 after 1k GPT-4 examples

Scaling the number of human-labeled tasks improves performance more than scaling instances per task when total instances are fixed.

Numbers100k total: 10 tasks (10k each) MME 723.9 vs 187 tasks (500 each) MME 1314.3 (Table 3)

Large amounts of GPT-4-synthesized data do not increase core capability and raise hallucination and 'yes' bias.

NumbersIncreasing GPT-4 instances shows no MME/ MM-Bench gain and raises 'Yes' ratio and hallucination (Figures 5–7)

Results

MM-Bench

Value69.8 (VISION-FLAN BASE)

BaselineLLaVA 1.5 66.7

MME

Value1537.8 (VISION-FLAN BASE)

BaselineLLaVA 1.5 1531.3

LLaVA-Bench (human-preference)

Value38.5 (VISION-FLAN BASE) -> 78.3 (VISION-FLAN CHAT after 1k GPT-4)

BaselineVISION-FLAN BASE 38.5

Catastrophic Forgetting (CF averaged)

Value87.2 (VISION-FLAN BASE)

BaselineLLaVA 1.5 73.3

Two-stage vs mixed fine-tuning

ValueTwo-stage (1k GPT-4): MME 1490.6; LLaVA-Bench 78.3; MM-Vet 38.0

BaselineMixed data (1k): MME 1364; LLaVA-Bench 52.7; MM-Vet 36.6

Who Should Care

What To Try In 7 Days

Audit your visual training data: count distinct task types and add missing ones (OCR, detection, domain tests).

Fine-tune your VLM for one epoch on a small diverse human-labeled task set to boost generalization.

Run a 1,000-example GPT-4 alignment pass and compare human-preference metrics to heavy synthetic tuning.

Agent Features

Architectures

  • LLaVA-Architecture
  • Vicuna-13B v1.5
  • CLIP-ViT-L-336px

Optimization Features

Training Optimization

  • two-stage fine-tuning (human tasks then small GPT-4 alignment)

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • All tasks and instructions are English-only, limiting multilingual use.
  • VISION-FLAN focuses on single-image tasks; multi-image or video scenarios are not covered.
  • Experiments mainly use the LLaVA-Architecture, so results may vary for other bridging modules.

When Not To Use

  • You need non-English or multilingual visual instruction tuning.
  • Your application requires multi-image or video reasoning.
  • You use a different VLM architecture without similar bridging modules.

Failure Modes

  • Large-scale GPT-4 synthetic tuning can increase hallucinations and bias toward 'Yes' answers.
  • Mixing large synthetic datasets with human-labeled data in a single-stage can worsen alignment and capability versus two-stage.
  • If bridging MLPs and LLMs are not tuned appropriately, capability gains may be lost.

Core Entities

Models

  • LLaVA-Architecture
  • VISION-FLAN BASE
  • VISION-FLAN CHAT
  • Vicuna-13B v1.5
  • CLIP-ViT-L-336px
  • LLaMA 2 Chat
  • BLIP-2
  • InstructBLIP
  • Shikra
  • Qwen-VL
  • LLaVA 1.5

Metrics

  • MM-Bench score
  • MME score
  • LLaVA-Bench score (human-preference)
  • POPE hallucination
  • CF averaged score (catastrophic forgetting)

Datasets

  • VISION-FLAN
  • LLaVA (GPT-4 synthesized)
  • MM-Bench
  • MME
  • MM-Vet
  • LLaVA-Bench
  • POPE
  • MMMU
  • CIFAR-10
  • CIFAR-100
  • MNIST
  • miniImageNet
  • TextOCR

Benchmarks

  • MM-Bench
  • MME
  • LLaVA-Bench
  • MM-Vet
  • POPE
  • MMMU
  • catastrophic forgetting (CF) suite