Use vision-language models to auto-generate and iteratively correct multimodal instruction data

Overview

Decision SnapshotReady For Pilot

The method shows practical, reproducible gains when used to augment instruction tuning with generated multimodal QA pairs; results are supported by multiple benchmarks, but hallucination and domain bias still require monitoring.

Citations3

Evidence Strength0.75

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, Conghui He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

VIGC can cheaply scale multimodal instruction data and improve model performance on perception and knowledge VQA tasks, reducing the need for costly human annotation while trimming hallucinations through an automated correction loop.

Who Should Care

ML Engineer Data Scientist Engineering Lead CTO Product Manager

Summary TLDR

VIGC is a two-part pipeline that uses existing vision-language models to generate large, diverse image-question-answer instruction datasets (VIG) and then reduce hallucinations by iteratively reconditioning visual features on the question and partial answers (VIC). The authors release ~36.8K COCO-based instances and ~1.8M Objects365 instances, show consistent gains when fine-tuning multimodal models (LLaVA/MiniGPT-4+/InstructBLIP) on VIGC data across LLaVA-eval, MMBench, OKVQA and A-OKVQA, and report a large drop in detected hallucinations after VIC correction. The pipeline is practical to run (10 hours on 8 A100s for training Q-Former/projection) and is available with code and datasets.

Problem Statement

High-quality vision-language instruction data is scarce. Language-only generators (e.g., GPT-4) need manual pre-annotations and lose image detail. Existing multimodal models can generate data but often hallucinate or produce low-quality answers. The paper asks: can we automatically produce diverse, high-quality visual instruction data from multimodal models and reduce hallucinations so the data is useful for instruction tuning?

Main Contribution

Introduce VIGC: a two-stage self-instruct pipeline—Visual Instruction Generation (VIG) to synthesize image-question-answer pairs, and Visual Instruction Correction (VIC) to iteratively fix hallucinations using an Iterative Q-Former (IQF) update.

Release generated datasets: 36,781 COCO-based VIGC-LLaVACOCO and about 1.8M VIGC-LLaVAObjects365 instances for multimodal instruction tuning.

Key Findings

Fine-tuning with VIGC COCO data improved LLaVA-7B overall score.

NumbersOverall 81.0 -> 85.8 (↑4.8)

Practical UseAdd VIGC COCO-style synthetic QA pairs to instruction data to get ~4–5 point relative gains on LLaVA-style evaluations.

Evidence RefTable 3 (LLaVA-Eval)

VIC correction sharply reduced detected hallucinations in generated descriptions.

NumbersHallucination count 66% -> 10% (on 100-image test)

Practical UseApply VIC iterative correction to cut hallucinations in generated data; this yields cleaner training data and better fine-tuned behavior.

Evidence RefTable 9 / Appendix B

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LLaVA-Eval overall (LLaVA-7B)	85.8	81.0	↑4.8	LLaVA-Eval (all)	Table 3: LLaVA-7B w/ coco overall 85.8 vs baseline 81.0	Table 3
Hallucination rate in generated descriptions	10%	66%	↓56pp	100-image test set (VIG vs VIC)	Table 9: hallucination count reduced from 66% (VIG) to 10% (VIC)	Table 9 / Appendix B

What To Try In 7 Days

Run VIGC on a small, domain-specific image subset to generate ~10k synthetic QA pairs and fine-tune a 7B multimodal model.

Compare model answers before and after applying VIC to measure hallucination reduction on a 100-image holdout.

Evaluate tuned model on one domain benchmark (e.g., OKVQA or custom task) to measure practical gains in 1–2 points of accuracy.

Agent Features

Frameworks

MiniGPT-4+BLIP2 Q-Former

Architectures

ViT visual encoderQ-Former (BLIP2-style)Vicuna LLMFC projection layer

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/opendatalab/VIGC https://opendatalab.github.io/VIGC

Data URLs

https://opendatalab.github.io/VIGC https://opendatalab.com/OpenDataLab/VIGC-InstData

Risks & Boundaries

Limitations

VIC reduces but does not eliminate hallucinations; some false details persist.

Quality of generated data depends on the underlying VLM/LLM; weaker base models will limit output quality.

When Not To Use

When you need perfect factual grounding without manual verification.

For single-sentence dialogue tasks where VIC iterative updates are less effective.

Failure Modes

Hallucination: generating objects or facts not present in the image.

Training-data bias: common co-occurrences produce stereotyped answers.

Core Entities

Models

MiniGPT-4+MiniGPT-4Vicuna7BVicuna13BLLaVA-7BLLaVA-13BInstructBLIPPaLM-E

Metrics

LLaVA relative score to GPT-4MMBench: LR (Logic Reasoning)MMBench: AR (Attribute Reasoning)MMBench: RR (Relation Reasoning)MMBench: FP-S (Fine Perception - instance)MMBench: FP-C (Fine Perception - cross-instance)MMBench: CP (Coarse Perception)AccuracyHallucination count (%)

Datasets

LLaVA-150KCOCOObjects365OKVQAA-OKVQAVIGC-LLaVACOCO (36,781)VIGC-LLaVAObjects365 (~1.8M)coco-extra

Benchmarks

LLaVA-EvalMMBenchOKVQAA-OKVQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Fine-tuning with VIGC COCO data improved LLaVA-7B overall score.

VIC correction sharply reduced detected hallucinations in generated descriptions.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding