Overview
The pipeline uses existing MLLMs and LLMs and shows consistent metric gains; it is practical but requires running multiple MLLM queries and LLM prompts, so expect moderate compute and integration cost.
Citations1
Evidence Strength0.70
Confidence0.86
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Cleaning synthetic visual instruction data cuts hallucinations and raises real-world reliability of multimodal models, reducing downstream errors and the need for runtime correction.
Who Should Care
Summary TLDR
The authors identify that large amounts of machine-generated visual instruction data contain object, relation, and attribute hallucinations that teach multimodal LLMs to make incorrect claims about images. They introduce HalluciDoctor: a data-level pipeline that (1) parses description text into 'answer chunks' (objects/relations/attributes), (2) uses an LLM to generate targeted questions, (3) queries multiple MLLM experts for image-oriented answers and scores consistency with a BERT-based metric, and (4) removes low-consistency chunks with LLM-based re-writing. They also add a seesaw counterfactual expansion to rebalance rare object co-occurrences. On several benchmarks the cleaned datasets,
Problem Statement
Machine-generated visual instruction datasets (used to teach multimodal LLMs) contain many factual errors—objects, relations, or attributes claimed in captions that are not in images. Training on this noisy data increases hallucinations in MLLMs. The problem: how to detect and remove diverse hallucinations automatically at scale without heavy manual labeling, and how to reduce spurious co-occurrence biases that cause hallucinations.
Main Contribution
A taxonomy and extended CHAIR metric that measures object, relation, and attribute hallucinations in visual instruction data.
HalluciDoctor: an automated cross-checking pipeline that extracts textual scene-graph chunks, generates answer-based questions, queries multiple MLLMs, and flags low-consistency chunks for removal.
Key Findings
Machine-generated LLaVA data cause frequent hallucinations in tuned MLLMs.
HalluciDoctor’s data cleaning (LLaVA+) substantially reduces hallucinations.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Sentence-level CHAIR_obj (model-agnostic) | LLaVA 32.6% → LLaVA+ 22.2% → LLaVA++ 19.3% | LLaVA 32.6% | −13.3 pp (LLaVA → LLaVA++) | CHAIR evaluation (500 images from MSCOCO∩VG) | Table 2 reports sentence-level CHAIR results for model-agnostic setups. | Table 2 |
| Sentence-level CHAIR_obj (MiniGPT-4) | LLaVA 35.0% → LLaVA+ 19.6% → LLaVA++ 16.6% | LLaVA 35.0% | −18.4 pp (LLaVA → LLaVA++) | CHAIR evaluation (500 images) | Table 2 (specific MiniGPT-4 rows). | Table 2 |
What To Try In 7 Days
Run CHAIR on your visual instruction set to quantify object/relation/attribute hallucinations.
Extract scene-graph chunks and generate answer-based questions using an LLM (ChatGPT).
Query 2–3 off-the-shelf MLLM experts; compute consistency scores with a BERT-based metric and flag low-consistency chunks (<0.5). Replace or remove flagged phrases via LLM rewrite
Reproducibility
Risks & Boundaries
Limitations
Dependence on the quality and diversity of the MLLM experts used for cross-checking; poor experts can miss or mislabel hallucinations.
Consistency threshold is sensitive: too high removes correct information; authors pick 0.5 after ablation.
When Not To Use
If you have fully human-verified visual instruction data already.
If you cannot afford the compute to query several MLLMs and run LLM prompts at dataset scale.
Failure Modes
Removing accurate phrases when consistency signals are weak, reducing useful diversity.
Bias propagation from MLLM experts: consistent but wrong answers vote to keep hallucinations.

