Overview
The method shows practical, reproducible gains when used to augment instruction tuning with generated multimodal QA pairs; results are supported by multiple benchmarks, but hallucination and domain bias still require monitoring.
Citations3
Evidence Strength0.75
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
VIGC can cheaply scale multimodal instruction data and improve model performance on perception and knowledge VQA tasks, reducing the need for costly human annotation while trimming hallucinations through an automated correction loop.
Who Should Care
Summary TLDR
VIGC is a two-part pipeline that uses existing vision-language models to generate large, diverse image-question-answer instruction datasets (VIG) and then reduce hallucinations by iteratively reconditioning visual features on the question and partial answers (VIC). The authors release ~36.8K COCO-based instances and ~1.8M Objects365 instances, show consistent gains when fine-tuning multimodal models (LLaVA/MiniGPT-4+/InstructBLIP) on VIGC data across LLaVA-eval, MMBench, OKVQA and A-OKVQA, and report a large drop in detected hallucinations after VIC correction. The pipeline is practical to run (10 hours on 8 A100s for training Q-Former/projection) and is available with code and datasets.
Problem Statement
High-quality vision-language instruction data is scarce. Language-only generators (e.g., GPT-4) need manual pre-annotations and lose image detail. Existing multimodal models can generate data but often hallucinate or produce low-quality answers. The paper asks: can we automatically produce diverse, high-quality visual instruction data from multimodal models and reduce hallucinations so the data is useful for instruction tuning?
Main Contribution
Introduce VIGC: a two-stage self-instruct pipeline—Visual Instruction Generation (VIG) to synthesize image-question-answer pairs, and Visual Instruction Correction (VIC) to iteratively fix hallucinations using an Iterative Q-Former (IQF) update.
Release generated datasets: 36,781 COCO-based VIGC-LLaVACOCO and about 1.8M VIGC-LLaVAObjects365 instances for multimodal instruction tuning.
Key Findings
Fine-tuning with VIGC COCO data improved LLaVA-7B overall score.
VIC correction sharply reduced detected hallucinations in generated descriptions.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| LLaVA-Eval overall (LLaVA-7B) | 85.8 | 81.0 | ↑4.8 | LLaVA-Eval (all) | Table 3: LLaVA-7B w/ coco overall 85.8 vs baseline 81.0 | Table 3 |
| Hallucination rate in generated descriptions | 10% | 66% | ↓56pp | 100-image test set (VIG vs VIC) | Table 9: hallucination count reduced from 66% (VIG) to 10% (VIC) | Table 9 / Appendix B |
What To Try In 7 Days
Run VIGC on a small, domain-specific image subset to generate ~10k synthetic QA pairs and fine-tune a 7B multimodal model.
Compare model answers before and after applying VIC to measure hallucination reduction on a 100-image holdout.
Evaluate tuned model on one domain benchmark (e.g., OKVQA or custom task) to measure practical gains in 1–2 points of accuracy.
Agent Features
Frameworks
Architectures
Reproducibility
Risks & Boundaries
Limitations
VIC reduces but does not eliminate hallucinations; some false details persist.
Quality of generated data depends on the underlying VLM/LLM; weaker base models will limit output quality.
When Not To Use
When you need perfect factual grounding without manual verification.
For single-sentence dialogue tasks where VIC iterative updates are less effective.
Failure Modes
Hallucination: generating objects or facts not present in the image.
Training-data bias: common co-occurrences produce stereotyped answers.

