Automatically find and remove hallucinations in machine-generated visual instructions to make multi-modal LLMs more accurate.

November 22, 20237 min

Overview

Decision SnapshotNeeds Validation

The pipeline uses existing MLLMs and LLMs and shows consistent metric gains; it is practical but requires running multiple MLLM queries and LLM prompts, so expect moderate compute and integration cost.

Citations1

Evidence Strength0.70

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Qifan Yu, Juncheng Li, Longhui Wei, Liang Pang, Wentao Ye, Bosheng Qin, Siliang Tang, Qi Tian, Yueting Zhuang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Cleaning synthetic visual instruction data cuts hallucinations and raises real-world reliability of multimodal models, reducing downstream errors and the need for runtime correction.

Who Should Care

Summary TLDR

The authors identify that large amounts of machine-generated visual instruction data contain object, relation, and attribute hallucinations that teach multimodal LLMs to make incorrect claims about images. They introduce HalluciDoctor: a data-level pipeline that (1) parses description text into 'answer chunks' (objects/relations/attributes), (2) uses an LLM to generate targeted questions, (3) queries multiple MLLM experts for image-oriented answers and scores consistency with a BERT-based metric, and (4) removes low-consistency chunks with LLM-based re-writing. They also add a seesaw counterfactual expansion to rebalance rare object co-occurrences. On several benchmarks the cleaned datasets,

Problem Statement

Machine-generated visual instruction datasets (used to teach multimodal LLMs) contain many factual errors—objects, relations, or attributes claimed in captions that are not in images. Training on this noisy data increases hallucinations in MLLMs. The problem: how to detect and remove diverse hallucinations automatically at scale without heavy manual labeling, and how to reduce spurious co-occurrence biases that cause hallucinations.

Main Contribution

A taxonomy and extended CHAIR metric that measures object, relation, and attribute hallucinations in visual instruction data.

HalluciDoctor: an automated cross-checking pipeline that extracts textual scene-graph chunks, generates answer-based questions, queries multiple MLLMs, and flags low-consistency chunks for removal.

Key Findings

Machine-generated LLaVA data cause frequent hallucinations in tuned MLLMs.

Numbers32.6% sentence-level CHAIR_obj when fine-tuned on LLaVA (Table 2).

Practical UseDon’t assume machine-generated instruction data is factual—if you fine-tune on it, expect ~30%+ object hallucination rates on evaluated images; clean the data first.

Evidence RefTable 2; Sec.3 and Sec.5.2

HalluciDoctor’s data cleaning (LLaVA+) substantially reduces hallucinations.

NumbersSentence-level CHAIR_obj 32.6%22.2% (model-agnostic, Table 2).

Practical UseFiltering and rewriting hallucinated chunks before training can cut object hallucinations by ~30–40% on evaluated benchmarks.

Evidence RefTable 2; Sec.5.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Sentence-level CHAIR_obj (model-agnostic)LLaVA 32.6% → LLaVA+ 22.2% → LLaVA++ 19.3%LLaVA 32.6%−13.3 pp (LLaVA → LLaVA++)CHAIR evaluation (500 images from MSCOCO∩VG)Table 2 reports sentence-level CHAIR results for model-agnostic setups.Table 2
Sentence-level CHAIR_obj (MiniGPT-4)LLaVA 35.0% → LLaVA+ 19.6% → LLaVA++ 16.6%LLaVA 35.0%−18.4 pp (LLaVA → LLaVA++)CHAIR evaluation (500 images)Table 2 (specific MiniGPT-4 rows).Table 2

What To Try In 7 Days

Run CHAIR on your visual instruction set to quantify object/relation/attribute hallucinations.

Extract scene-graph chunks and generate answer-based questions using an LLM (ChatGPT).

Query 2–3 off-the-shelf MLLM experts; compute consistency scores with a BERT-based metric and flag low-consistency chunks (<0.5). Replace or remove flagged phrases via LLM rewrite

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Dependence on the quality and diversity of the MLLM experts used for cross-checking; poor experts can miss or mislabel hallucinations.

Consistency threshold is sensitive: too high removes correct information; authors pick 0.5 after ablation.

When Not To Use

If you have fully human-verified visual instruction data already.

If you cannot afford the compute to query several MLLMs and run LLM prompts at dataset scale.

Failure Modes

Removing accurate phrases when consistency signals are weak, reducing useful diversity.

Bias propagation from MLLM experts: consistent but wrong answers vote to keep hallucinations.

Core Entities

Models

MiniGPT-4LLaVAmPLUG-OwlBLIP2InstructBLIP

Metrics

CHAIR_obj/CHAIR_rel/CHAIR_attriConScore (consistency score, threshold 0.5)MME total scoreAccuracy

Datasets

LLaVA-Instruction-158KMiniGPT4-InstructionLLaVA+ (rectified)LLaVA++ (expanded/rectified)

Benchmarks

MMECHAIR (extended)POPEOwlEvalMSCOCOVisual GenomeNoCapsGQAAOK-VQA