Use vision-language models to auto-generate and iteratively correct multimodal instruction data

August 24, 20237 min

Overview

Decision SnapshotReady For Pilot

The method shows practical, reproducible gains when used to augment instruction tuning with generated multimodal QA pairs; results are supported by multiple benchmarks, but hallucination and domain bias still require monitoring.

Citations3

Evidence Strength0.75

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, Conghui He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

VIGC can cheaply scale multimodal instruction data and improve model performance on perception and knowledge VQA tasks, reducing the need for costly human annotation while trimming hallucinations through an automated correction loop.

Who Should Care

Summary TLDR

VIGC is a two-part pipeline that uses existing vision-language models to generate large, diverse image-question-answer instruction datasets (VIG) and then reduce hallucinations by iteratively reconditioning visual features on the question and partial answers (VIC). The authors release ~36.8K COCO-based instances and ~1.8M Objects365 instances, show consistent gains when fine-tuning multimodal models (LLaVA/MiniGPT-4+/InstructBLIP) on VIGC data across LLaVA-eval, MMBench, OKVQA and A-OKVQA, and report a large drop in detected hallucinations after VIC correction. The pipeline is practical to run (10 hours on 8 A100s for training Q-Former/projection) and is available with code and datasets.

Problem Statement

High-quality vision-language instruction data is scarce. Language-only generators (e.g., GPT-4) need manual pre-annotations and lose image detail. Existing multimodal models can generate data but often hallucinate or produce low-quality answers. The paper asks: can we automatically produce diverse, high-quality visual instruction data from multimodal models and reduce hallucinations so the data is useful for instruction tuning?

Main Contribution

Introduce VIGC: a two-stage self-instruct pipeline—Visual Instruction Generation (VIG) to synthesize image-question-answer pairs, and Visual Instruction Correction (VIC) to iteratively fix hallucinations using an Iterative Q-Former (IQF) update.

Release generated datasets: 36,781 COCO-based VIGC-LLaVACOCO and about 1.8M VIGC-LLaVAObjects365 instances for multimodal instruction tuning.

Key Findings

Fine-tuning with VIGC COCO data improved LLaVA-7B overall score.

NumbersOverall 81.0 -> 85.8 (↑4.8)

Practical UseAdd VIGC COCO-style synthetic QA pairs to instruction data to get ~4–5 point relative gains on LLaVA-style evaluations.

Evidence RefTable 3 (LLaVA-Eval)

VIC correction sharply reduced detected hallucinations in generated descriptions.

NumbersHallucination count 66% -> 10% (on 100-image test)

Practical UseApply VIC iterative correction to cut hallucinations in generated data; this yields cleaner training data and better fine-tuned behavior.

Evidence RefTable 9 / Appendix B

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LLaVA-Eval overall (LLaVA-7B)85.881.04.8LLaVA-Eval (all)Table 3: LLaVA-7B w/ coco overall 85.8 vs baseline 81.0Table 3
Hallucination rate in generated descriptions10%66%↓56pp100-image test set (VIG vs VIC)Table 9: hallucination count reduced from 66% (VIG) to 10% (VIC)Table 9 / Appendix B

What To Try In 7 Days

Run VIGC on a small, domain-specific image subset to generate ~10k synthetic QA pairs and fine-tune a 7B multimodal model.

Compare model answers before and after applying VIC to measure hallucination reduction on a 100-image holdout.

Evaluate tuned model on one domain benchmark (e.g., OKVQA or custom task) to measure practical gains in 1–2 points of accuracy.

Agent Features

Frameworks
MiniGPT-4+BLIP2 Q-Former
Architectures
ViT visual encoderQ-Former (BLIP2-style)Vicuna LLMFC projection layer

Reproducibility

Risks & Boundaries

Limitations

VIC reduces but does not eliminate hallucinations; some false details persist.

Quality of generated data depends on the underlying VLM/LLM; weaker base models will limit output quality.

When Not To Use

When you need perfect factual grounding without manual verification.

For single-sentence dialogue tasks where VIC iterative updates are less effective.

Failure Modes

Hallucination: generating objects or facts not present in the image.

Training-data bias: common co-occurrences produce stereotyped answers.

Core Entities

Models

MiniGPT-4+MiniGPT-4Vicuna7BVicuna13BLLaVA-7BLLaVA-13BInstructBLIPPaLM-E

Metrics

LLaVA relative score to GPT-4MMBench: LR (Logic Reasoning)MMBench: AR (Attribute Reasoning)MMBench: RR (Relation Reasoning)MMBench: FP-S (Fine Perception - instance)MMBench: FP-C (Fine Perception - cross-instance)MMBench: CP (Coarse Perception)AccuracyHallucination count (%)

Datasets

LLaVA-150KCOCOObjects365OKVQAA-OKVQAVIGC-LLaVACOCO (36,781)VIGC-LLaVAObjects365 (~1.8M)coco-extra

Benchmarks

LLaVA-EvalMMBenchOKVQAA-OKVQA