Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
Cleaning synthetic visual instruction data cuts hallucinations and raises real-world reliability of multimodal models, reducing downstream errors and the need for runtime correction.
Summary TLDR
The authors identify that large amounts of machine-generated visual instruction data contain object, relation, and attribute hallucinations that teach multimodal LLMs to make incorrect claims about images. They introduce HalluciDoctor: a data-level pipeline that (1) parses description text into 'answer chunks' (objects/relations/attributes), (2) uses an LLM to generate targeted questions, (3) queries multiple MLLM experts for image-oriented answers and scores consistency with a BERT-based metric, and (4) removes low-consistency chunks with LLM-based re-writing. They also add a seesaw counterfactual expansion to rebalance rare object co-occurrences. On several benchmarks the cleaned datasets,
Problem Statement
Machine-generated visual instruction datasets (used to teach multimodal LLMs) contain many factual errors—objects, relations, or attributes claimed in captions that are not in images. Training on this noisy data increases hallucinations in MLLMs. The problem: how to detect and remove diverse hallucinations automatically at scale without heavy manual labeling, and how to reduce spurious co-occurrence biases that cause hallucinations.
Main Contribution
A taxonomy and extended CHAIR metric that measures object, relation, and attribute hallucinations in visual instruction data.
HalluciDoctor: an automated cross-checking pipeline that extracts textual scene-graph chunks, generates answer-based questions, queries multiple MLLMs, and flags low-consistency chunks for removal.
A seesaw-based counterfactual instruction expansion that rebalances long-tail object co-occurrences to reduce spurious correlations and strengthen robustness.
Public release of rectified datasets (LLaVA+ and LLaVA++) and code for reproducible cleaning and expansion.
Key Findings
Machine-generated LLaVA data cause frequent hallucinations in tuned MLLMs.
HalluciDoctor’s data cleaning (LLaVA+) substantially reduces hallucinations.
Cleaning plus counterfactual expansion (LLaVA++) improves both robustness and task scores.
Results
Sentence-level CHAIR_obj (model-agnostic)
Sentence-level CHAIR_obj (MiniGPT-4)
MME total score (MiniGPT-4)
Accuracy
Who Should Care
What To Try In 7 Days
Run CHAIR on your visual instruction set to quantify object/relation/attribute hallucinations.
Extract scene-graph chunks and generate answer-based questions using an LLM (ChatGPT).
Query 2–3 off-the-shelf MLLM experts; compute consistency scores with a BERT-based metric and flag low-consistency chunks (<0.5). Replace or remove flagged phrases via LLM rewrite
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Dependence on the quality and diversity of the MLLM experts used for cross-checking; poor experts can miss or mislabel hallucinations.
- Consistency threshold is sensitive: too high removes correct information; authors pick 0.5 after ablation.
- Counterfactual expansion relies on image synthesis/placement steps that can introduce visual artifacts or unrealistic training signals.
- Relation and attribute pseudo-labels use foundation models (GroundingDINO, BLIP) and may be noisy.
When Not To Use
- If you have fully human-verified visual instruction data already.
- If you cannot afford the compute to query several MLLMs and run LLM prompts at dataset scale.
- If you are restricted from modifying training data and can only change inference-time policies.
Failure Modes
- Removing accurate phrases when consistency signals are weak, reducing useful diversity.
- Bias propagation from MLLM experts: consistent but wrong answers vote to keep hallucinations.
- Counterfactual examples may create distributional artifacts that hurt generalization if synthesis is low-quality.
- Threshold miscalibration may either miss hallucinations or over-delete content.
Core Entities
Models
- MiniGPT-4
- LLaVA
- mPLUG-Owl
- BLIP2
- InstructBLIP
Metrics
- CHAIR_obj/CHAIR_rel/CHAIR_attri
- ConScore (consistency score, threshold 0.5)
- MME total score
- Accuracy
Datasets
- LLaVA-Instruction-158K
- MiniGPT4-Instruction
- LLaVA+ (rectified)
- LLaVA++ (expanded/rectified)
Benchmarks
- MME
- CHAIR (extended)
- POPE
- OwlEval
- MSCOCO
- Visual Genome
- NoCaps
- GQA
- AOK-VQA

