Automatically find and remove hallucinations in machine-generated visual instructions to make multi-modal LLMs more accurate.

Overview

Decision SnapshotNeeds Validation

The pipeline uses existing MLLMs and LLMs and shows consistent metric gains; it is practical but requires running multiple MLLM queries and LLM prompts, so expect moderate compute and integration cost.

Citations1

Evidence Strength0.70

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Qifan Yu, Juncheng Li, Longhui Wei, Liang Pang, Wentao Ye, Bosheng Qin, Siliang Tang, Qi Tian, Yueting Zhuang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Cleaning synthetic visual instruction data cuts hallucinations and raises real-world reliability of multimodal models, reducing downstream errors and the need for runtime correction.

Who Should Care

ML Engineer Data Scientist Product Manager Engineering Lead CTO

Summary TLDR

The authors identify that large amounts of machine-generated visual instruction data contain object, relation, and attribute hallucinations that teach multimodal LLMs to make incorrect claims about images. They introduce HalluciDoctor: a data-level pipeline that (1) parses description text into 'answer chunks' (objects/relations/attributes), (2) uses an LLM to generate targeted questions, (3) queries multiple MLLM experts for image-oriented answers and scores consistency with a BERT-based metric, and (4) removes low-consistency chunks with LLM-based re-writing. They also add a seesaw counterfactual expansion to rebalance rare object co-occurrences. On several benchmarks the cleaned datasets,

Problem Statement

Machine-generated visual instruction datasets (used to teach multimodal LLMs) contain many factual errors—objects, relations, or attributes claimed in captions that are not in images. Training on this noisy data increases hallucinations in MLLMs. The problem: how to detect and remove diverse hallucinations automatically at scale without heavy manual labeling, and how to reduce spurious co-occurrence biases that cause hallucinations.

Main Contribution

A taxonomy and extended CHAIR metric that measures object, relation, and attribute hallucinations in visual instruction data.

HalluciDoctor: an automated cross-checking pipeline that extracts textual scene-graph chunks, generates answer-based questions, queries multiple MLLMs, and flags low-consistency chunks for removal.

Key Findings

Machine-generated LLaVA data cause frequent hallucinations in tuned MLLMs.

Numbers32.6% sentence-level CHAIR_obj when fine-tuned on LLaVA (Table 2).

Practical UseDon’t assume machine-generated instruction data is factual—if you fine-tune on it, expect ~30%+ object hallucination rates on evaluated images; clean the data first.

Evidence RefTable 2; Sec.3 and Sec.5.2

HalluciDoctor’s data cleaning (LLaVA+) substantially reduces hallucinations.

NumbersSentence-level CHAIR_obj 32.6% → 22.2% (model-agnostic, Table 2).

Practical UseFiltering and rewriting hallucinated chunks before training can cut object hallucinations by ~30–40% on evaluated benchmarks.

Evidence RefTable 2; Sec.5.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Sentence-level CHAIR_obj (model-agnostic)	LLaVA 32.6% → LLaVA+ 22.2% → LLaVA++ 19.3%	LLaVA 32.6%	−13.3 pp (LLaVA → LLaVA++)	CHAIR evaluation (500 images from MSCOCO∩VG)	Table 2 reports sentence-level CHAIR results for model-agnostic setups.	Table 2
Sentence-level CHAIR_obj (MiniGPT-4)	LLaVA 35.0% → LLaVA+ 19.6% → LLaVA++ 16.6%	LLaVA 35.0%	−18.4 pp (LLaVA → LLaVA++)	CHAIR evaluation (500 images)	Table 2 (specific MiniGPT-4 rows).	Table 2

What To Try In 7 Days

Run CHAIR on your visual instruction set to quantify object/relation/attribute hallucinations.

Extract scene-graph chunks and generate answer-based questions using an LLM (ChatGPT).

Query 2–3 off-the-shelf MLLM experts; compute consistency scores with a BERT-based metric and flag low-consistency chunks (<0.5). Replace or remove flagged phrases via LLM rewrite

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/Yuqifan1117/HalluciDoctor

Data URLs

https://github.com/Yuqifan1117/HalluciDoctor

Risks & Boundaries

Limitations

Dependence on the quality and diversity of the MLLM experts used for cross-checking; poor experts can miss or mislabel hallucinations.

Consistency threshold is sensitive: too high removes correct information; authors pick 0.5 after ablation.

When Not To Use

If you have fully human-verified visual instruction data already.

If you cannot afford the compute to query several MLLMs and run LLM prompts at dataset scale.

Failure Modes

Removing accurate phrases when consistency signals are weak, reducing useful diversity.

Bias propagation from MLLM experts: consistent but wrong answers vote to keep hallucinations.

Core Entities

Models

MiniGPT-4LLaVAmPLUG-OwlBLIP2InstructBLIP

Metrics

CHAIR_obj/CHAIR_rel/CHAIR_attriConScore (consistency score, threshold 0.5)MME total scoreAccuracy

Datasets

LLaVA-Instruction-158KMiniGPT4-InstructionLLaVA+ (rectified)LLaVA++ (expanded/rectified)

Benchmarks

MMECHAIR (extended)POPEOwlEvalMSCOCOVisual GenomeNoCapsGQAAOK-VQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Machine-generated LLaVA data cause frequent hallucinations in tuned MLLMs.

HalluciDoctor’s data cleaning (LLaVA+) substantially reduces hallucinations.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding