Automatically find and remove hallucinations in machine-generated visual instructions to make multi-modal LLMs more accurate.

November 22, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Qifan Yu, Juncheng Li, Longhui Wei, Liang Pang, Wentao Ye, Bosheng Qin, Siliang Tang, Qi Tian, Yueting Zhuang

Links

Abstract / PDF

Why It Matters For Business

Cleaning synthetic visual instruction data cuts hallucinations and raises real-world reliability of multimodal models, reducing downstream errors and the need for runtime correction.

Summary TLDR

The authors identify that large amounts of machine-generated visual instruction data contain object, relation, and attribute hallucinations that teach multimodal LLMs to make incorrect claims about images. They introduce HalluciDoctor: a data-level pipeline that (1) parses description text into 'answer chunks' (objects/relations/attributes), (2) uses an LLM to generate targeted questions, (3) queries multiple MLLM experts for image-oriented answers and scores consistency with a BERT-based metric, and (4) removes low-consistency chunks with LLM-based re-writing. They also add a seesaw counterfactual expansion to rebalance rare object co-occurrences. On several benchmarks the cleaned datasets,

Problem Statement

Machine-generated visual instruction datasets (used to teach multimodal LLMs) contain many factual errors—objects, relations, or attributes claimed in captions that are not in images. Training on this noisy data increases hallucinations in MLLMs. The problem: how to detect and remove diverse hallucinations automatically at scale without heavy manual labeling, and how to reduce spurious co-occurrence biases that cause hallucinations.

Main Contribution

A taxonomy and extended CHAIR metric that measures object, relation, and attribute hallucinations in visual instruction data.

HalluciDoctor: an automated cross-checking pipeline that extracts textual scene-graph chunks, generates answer-based questions, queries multiple MLLMs, and flags low-consistency chunks for removal.

A seesaw-based counterfactual instruction expansion that rebalances long-tail object co-occurrences to reduce spurious correlations and strengthen robustness.

Public release of rectified datasets (LLaVA+ and LLaVA++) and code for reproducible cleaning and expansion.

Key Findings

Machine-generated LLaVA data cause frequent hallucinations in tuned MLLMs.

Numbers32.6% sentence-level CHAIR_obj when fine-tuned on LLaVA (Table 2).

HalluciDoctor’s data cleaning (LLaVA+) substantially reduces hallucinations.

NumbersSentence-level CHAIR_obj 32.6% → 22.2% (model-agnostic, Table 2).

Cleaning plus counterfactual expansion (LLaVA++) improves both robustness and task scores.

NumbersMiniGPT-4 MME total: 1148.93 → 1207.18 → ~1276.0 (LLaVA → LLaVA+ → LLaVA++, Table 3).

Results

Sentence-level CHAIR_obj (model-agnostic)

ValueLLaVA 32.6% → LLaVA+ 22.2% → LLaVA++ 19.3%

BaselineLLaVA 32.6%

Sentence-level CHAIR_obj (MiniGPT-4)

ValueLLaVA 35.0% → LLaVA+ 19.6% → LLaVA++ 16.6%

BaselineLLaVA 35.0%

MME total score (MiniGPT-4)

Value1148.93 → 1207.18 → ~1276.0 (LLaVA → LLaVA+ → LLaVA++)

BaselineLLaVA 1148.93

Accuracy

Valuew/ LLaVA 75.1 / 77.8 → w/ LLaVA+ 79.1 / 80.0 → w/ LLaVA++ 80.1 / 80.4

Baselinew/ LLaVA 75.1 / 77.8

Who Should Care

What To Try In 7 Days

Run CHAIR on your visual instruction set to quantify object/relation/attribute hallucinations.

Extract scene-graph chunks and generate answer-based questions using an LLM (ChatGPT).

Query 2–3 off-the-shelf MLLM experts; compute consistency scores with a BERT-based metric and flag low-consistency chunks (<0.5). Replace or remove flagged phrases via LLM rewrite

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Dependence on the quality and diversity of the MLLM experts used for cross-checking; poor experts can miss or mislabel hallucinations.
  • Consistency threshold is sensitive: too high removes correct information; authors pick 0.5 after ablation.
  • Counterfactual expansion relies on image synthesis/placement steps that can introduce visual artifacts or unrealistic training signals.
  • Relation and attribute pseudo-labels use foundation models (GroundingDINO, BLIP) and may be noisy.

When Not To Use

  • If you have fully human-verified visual instruction data already.
  • If you cannot afford the compute to query several MLLMs and run LLM prompts at dataset scale.
  • If you are restricted from modifying training data and can only change inference-time policies.

Failure Modes

  • Removing accurate phrases when consistency signals are weak, reducing useful diversity.
  • Bias propagation from MLLM experts: consistent but wrong answers vote to keep hallucinations.
  • Counterfactual examples may create distributional artifacts that hurt generalization if synthesis is low-quality.
  • Threshold miscalibration may either miss hallucinations or over-delete content.

Core Entities

Models

  • MiniGPT-4
  • LLaVA
  • mPLUG-Owl
  • BLIP2
  • InstructBLIP

Metrics

  • CHAIR_obj/CHAIR_rel/CHAIR_attri
  • ConScore (consistency score, threshold 0.5)
  • MME total score
  • Accuracy

Datasets

  • LLaVA-Instruction-158K
  • MiniGPT4-Instruction
  • LLaVA+ (rectified)
  • LLaVA++ (expanded/rectified)

Benchmarks

  • MME
  • CHAIR (extended)
  • POPE
  • OwlEval
  • MSCOCO
  • Visual Genome
  • NoCaps
  • GQA
  • AOK-VQA