A 187-task human-labeled dataset (1.66M instances) + two-stage tuning that needs only 1k GPT-4 examples to align VLM outputs

Overview

Decision SnapshotNeeds Validation

The paper gives multi-benchmark and ablation evidence that diverse human-labeled tasks improve capability and that a small GPT-4 alignment stage (~1k examples) is sufficient to change output style; results are shown across MM-Bench, MME, LLaVA-Bench, POPE and catastrophic-forgetting tests.

Citations2

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, Lifu Huang

Links

Abstract / PDF

Why It Matters For Business

Investing in diverse, human-labeled vision tasks gives larger capability gains and less forgetting than mass synthetic labeling; a small alignment set (~1k GPT-4 examples) can deliver chat-style outputs while avoiding the cost and bias of large synthetic corpora.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The authors release VISION-FLAN, a public visual instruction tuning dataset of 187 tasks and 1,664,261 instances built from academic datasets and expert-written instructions. They propose two-stage tuning: first fine-tune a VLM on VISION-FLAN for broad capabilities, then a light second-stage (1,000 GPT-4-synthesized examples) to align responses to human-preferred formats. Results show task diversity (human-labeled) raises multi-benchmark scores and reduces catastrophic forgetting. Large-scale GPT-4 synthetic data gives little capability gain and can add hallucination and 'yes' bias.

Problem Statement

Current visual instruction tuning relies heavily on GPT-4 synthesized data and caption-style pretraining. That yields narrow task coverage, poor generalization on diverse vision tasks (e.g., OCR), annotation bias and hallucination from synthetic labels, and catastrophic forgetting of basic detection tasks.

Main Contribution

VISION-FLAN: a public dataset of 187 human-labeled visual tasks (1,664,261 instances) with expert-written instructions

A two-stage visual instruction tuning recipe: (1) fine-tune on VISION-FLAN, (2) brief GPT-4-based alignment (1k examples)

Key Findings

VISION-FLAN is large and diverse: 187 tasks and 1,664,261 instances.

Numbers187 tasks; 1,664,261 instances

Practical UseUse diverse, task-level human labels to cover perception, OCR, reasoning, and domain tasks rather than relying solely on caption-style synthetic data.

Evidence RefSection 2, Table 1

VISION-FLAN BASE (trained only on human-labeled tasks) achieves top performance on comprehensive benchmarks.

NumbersMM-Bench 69.8; MME 1537.8 (Table 2)

Practical UsePrioritize time and budget for collecting diverse human-labeled task data to raise model capability across many vision benchmarks.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MM-Bench	69.8 (VISION-FLAN BASE)	LLaVA 1.5 66.7	+3.1	Table 2 aggregated	Table 2: VISION-FLAN BASE MM-Bench 69.8 vs LLaVA1.5 66.7	Table 2
MME	1537.8 (VISION-FLAN BASE)	LLaVA 1.5 1531.3	+6.5	Table 2 aggregated	Table 2: VISION-FLAN BASE MME 1537.8 vs LLaVA1.5 1531.3	Table 2

What To Try In 7 Days

Audit your visual training data: count distinct task types and add missing ones (OCR, detection, domain tests).

Fine-tune your VLM for one epoch on a small diverse human-labeled task set to boost generalization.

Run a 1,000-example GPT-4 alignment pass and compare human-preference metrics to heavy synthetic tuning.

Agent Features

Architectures

LLaVA-ArchitectureVicuna-13B v1.5CLIP-ViT-L-336px

Optimization Features

Training Optimization

two-stage fine-tuning (human tasks then small GPT-4 alignment)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

All tasks and instructions are English-only, limiting multilingual use.

VISION-FLAN focuses on single-image tasks; multi-image or video scenarios are not covered.

When Not To Use

You need non-English or multilingual visual instruction tuning.

Your application requires multi-image or video reasoning.

Failure Modes

Large-scale GPT-4 synthetic tuning can increase hallucinations and bias toward 'Yes' answers.

Mixing large synthetic datasets with human-labeled data in a single-stage can worsen alignment and capability versus two-stage.

Core Entities

Models

LLaVA-ArchitectureVISION-FLAN BASEVISION-FLAN CHATVicuna-13B v1.5CLIP-ViT-L-336pxLLaMA 2 ChatBLIP-2InstructBLIPShikraQwen-VLLLaVA 1.5

Metrics

MM-Bench scoreMME scoreLLaVA-Bench score (human-preference)POPE hallucinationCF averaged score (catastrophic forgetting)

Datasets

VISION-FLANLLaVA (GPT-4 synthesized)MM-BenchMMEMM-VetLLaVA-BenchPOPEMMMUCIFAR-10CIFAR-100MNISTminiImageNetTextOCR

Benchmarks

MM-BenchMMELLaVA-BenchMM-VetPOPEMMMUcatastrophic forgetting (CF) suite

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

VISION-FLAN is large and diverse: 187 tasks and 1,664,261 instances.

VISION-FLAN BASE (trained only on human-labeled tasks) achieves top performance on comprehensive benchmarks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding