Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
3
Why It Matters For Business
VIGC can cheaply scale multimodal instruction data and improve model performance on perception and knowledge VQA tasks, reducing the need for costly human annotation while trimming hallucinations through an automated correction loop.
Summary TLDR
VIGC is a two-part pipeline that uses existing vision-language models to generate large, diverse image-question-answer instruction datasets (VIG) and then reduce hallucinations by iteratively reconditioning visual features on the question and partial answers (VIC). The authors release ~36.8K COCO-based instances and ~1.8M Objects365 instances, show consistent gains when fine-tuning multimodal models (LLaVA/MiniGPT-4+/InstructBLIP) on VIGC data across LLaVA-eval, MMBench, OKVQA and A-OKVQA, and report a large drop in detected hallucinations after VIC correction. The pipeline is practical to run (10 hours on 8 A100s for training Q-Former/projection) and is available with code and datasets.
Problem Statement
High-quality vision-language instruction data is scarce. Language-only generators (e.g., GPT-4) need manual pre-annotations and lose image detail. Existing multimodal models can generate data but often hallucinate or produce low-quality answers. The paper asks: can we automatically produce diverse, high-quality visual instruction data from multimodal models and reduce hallucinations so the data is useful for instruction tuning?
Main Contribution
Introduce VIGC: a two-stage self-instruct pipeline—Visual Instruction Generation (VIG) to synthesize image-question-answer pairs, and Visual Instruction Correction (VIC) to iteratively fix hallucinations using an Iterative Q-Former (IQF) update.
Release generated datasets: 36,781 COCO-based VIGC-LLaVACOCO and about 1.8M VIGC-LLaVAObjects365 instances for multimodal instruction tuning.
Empirically show consistent downstream improvements: models fine-tuned with VIGC data improve on LLaVA-eval, MMBench, OKVQA and A-OKVQA; VIC reduces hallucinations markedly.
Key Findings
Fine-tuning with VIGC COCO data improved LLaVA-7B overall score.
VIC correction sharply reduced detected hallucinations in generated descriptions.
Adding VIGC COCO data improved general multimodal metrics (MMBench) for MiniGPT-4+.
Small but consistent gains on knowledge-heavy VQA benchmarks after adding generated data.
Results
LLaVA-Eval overall (LLaVA-7B)
Hallucination rate in generated descriptions
MMBench overall (MiniGPT-4+)
OKVQA (InstructBLIP)
A-OKVQA (InstructBLIP)
Who Should Care
What To Try In 7 Days
Run VIGC on a small, domain-specific image subset to generate ~10k synthetic QA pairs and fine-tune a 7B multimodal model.
Compare model answers before and after applying VIC to measure hallucination reduction on a 100-image holdout.
Evaluate tuned model on one domain benchmark (e.g., OKVQA or custom task) to measure practical gains in 1–2 points of accuracy.
Agent Features
Frameworks
- MiniGPT-4+
- BLIP2 Q-Former
Architectures
- ViT visual encoder
- Q-Former (BLIP2-style)
- Vicuna LLM
- FC projection layer
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- VIC reduces but does not eliminate hallucinations; some false details persist.
- Quality of generated data depends on the underlying VLM/LLM; weaker base models will limit output quality.
- Detailed description tasks still show information decay during long sequence generation.
When Not To Use
- When you need perfect factual grounding without manual verification.
- For single-sentence dialogue tasks where VIC iterative updates are less effective.
- If your base vision or language models are very weak, generated data may be low quality.
Failure Modes
- Hallucination: generating objects or facts not present in the image.
- Training-data bias: common co-occurrences produce stereotyped answers.
- Information decay: longer generated descriptions drift from the image.
Core Entities
Models
- MiniGPT-4+
- MiniGPT-4
- Vicuna7B
- Vicuna13B
- LLaVA-7B
- LLaVA-13B
- InstructBLIP
- PaLM-E
Metrics
- LLaVA relative score to GPT-4
- MMBench: LR (Logic Reasoning)
- MMBench: AR (Attribute Reasoning)
- MMBench: RR (Relation Reasoning)
- MMBench: FP-S (Fine Perception - instance)
- MMBench: FP-C (Fine Perception - cross-instance)
- MMBench: CP (Coarse Perception)
- Accuracy
- Hallucination count (%)
Datasets
- LLaVA-150K
- COCO
- Objects365
- OKVQA
- A-OKVQA
- VIGC-LLaVACOCO (36,781)
- VIGC-LLaVAObjects365 (~1.8M)
- coco-extra
Benchmarks
- LLaVA-Eval
- MMBench
- OKVQA
- A-OKVQA

