Automate rich, low-hallucination image captions by combining vision experts with multi-modal and text LLMs

June 11, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

3

Authors

Renjie Pi, Jianshu Zhang, Jipeng Zhang, Rui Pan, Zhekai Chen, Tong Zhang

Links

Abstract / PDF

Why It Matters For Business

IT automates high-quality, detailed captions that improve downstream image generation and reduce hallucination in vision-language systems, lowering labeling costs and improving model usefulness in search, generation, and retrieval.

Summary TLDR

The paper introduces Image Textualization (IT): a three-stage pipeline that (1) uses a multimodal LLM to produce a template caption, (2) extracts many fine-grained visual facts with vision expert models (dense captioner, detector, SAM, depth estimator) and detects hallucinations, and (3) asks a text-only LLM to re-write a detailed, low-hallucination caption from the textualized facts. The authors release an IT-170K dataset and three new benchmarks (DID-, D2I-, LIN-Bench). On those benchmarks, IT captions are longer, contain more object detail, reduce hallucination, and improve downstream MLLM fine-tuning (e.g., LLaVA-7B fine-tuned on IT data shows large gains on BLEU and semantic metrics).

Problem Statement

Existing image-caption data is either scraped and noisy or human-labeled and short and costly. Pure MLLMs hallucinate and miss fine detail. Vision models see local detail but lack holistic language fluency. We need an automated, scalable way to produce detailed, accurate image descriptions for training and downstream tasks.

Main Contribution

Image Textualization (IT): a 3-phase automatic pipeline that combines MLLMs, vision expert models, and LLM recaptioning for detailed captions.

Three evaluation benchmarks for long/detailed captions: DID-Bench (human-checked references), D2I-Bench (description→image similarity), and LIN-Bench (linguistic richness).

IT-170K: a released dataset of IT-generated captions and open-source code on GitHub and HuggingFace.

Key Findings

IT captions are substantially more informative and closer to human references than raw MLLM captions on automatic caption metrics.

NumbersBLEU-1: 11.35 → 23.78 (IT-LLaVA) and 11.35 → 46.79 (IT-GPT4-V) on combined GT

Descriptions from IT produce text-to-image outputs that match originals better by embedding similarity.

NumbersD2I CLIP-score: COCO 72.24 → IT-LLaVA 74.27 → IT-GPT4-V 77.10 on evaluated set

IT captions are longer and use more content words (nouns, verbs, adjectives).

NumbersAverage words per caption: LLaVA 92.6 → IT-LLaVA 131.6; GPT-4V 159.9 → IT-GPT4-V 193.3

Fine-tuning an MLLM with IT data reduces hallucination and increases descriptive richness.

NumbersPOPE (hallucination benchmark) and LIN-Bench show improved average scores for MLLMs tuned on IT-generated data (Table 2)

Results

BLEU-1 (DID-Bench, combined GT)

ValueBaseline 11.35 → IT-LLaVA 23.78 → IT-GPT4-V 46.79

Baseline11.35

D2I similarity (CLIP-score)

ValueCOCO 72.24 → IT-LLaVA 74.27 → IT-GPT4-V 77.10

Baseline72.24 (COCO)

Average words per description (LIN-Bench stat)

ValueLLaVA 92.57 → IT-LLaVA 131.61; GPT-4V 159.93 → IT-GPT4-V 193.33

Baseline92.57 (LLaVA), 159.93 (GPT-4V)

POPE (hallucination benchmark)

ValueTuning with IT data reduces hallucination scores compared to baselines on POPE per Table 2

Who Should Care

What To Try In 7 Days

Run IT pipeline on a small image subset to compare IT captions vs existing captions.

Fine-tune an MLLM (e.g., LLaVA-7B) on IT captions and evaluate hallucination on POPE.

Use IT captions to condition a text-to-image model and check D2I similarity to originals.

Reproducibility

License

  • Code MIT; data Apache-2.0

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Did not tune or test very large MLLMs (e.g., LLaVA-70B) due to compute limits, so gains on larger models are untested.
  • Benchmarks and automatic metrics can be sensitive to caption style; the paper notes evaluation depends on MLLM style and uses multiple ground-truth splits.
  • IT relies on the coverage and accuracy of the vision expert models; missing or wrong detections will limit caption correctness.

When Not To Use

  • When you need strictly human-authored stylistic captions (e.g., marketing copy) rather than detailed factual descriptions.
  • If you lack access to reliable vision expert models (dense captioners, detectors, SAM, depth estimator).
  • When compute or latency constraints prevent running multiple vision models and LLM recaptioning.

Failure Modes

  • Missed or incorrect object detections lead to missing details or false removals during hallucination detection.
  • Depth or size estimation errors can flip spatial relations, causing wrong 'foreground/background' claims.
  • Style bias from the initial MLLM reference description can affect final wording and automatic-metric scores.

Core Entities

Models

  • LLaVA-7B
  • GPT4-V
  • GroundingDINO
  • SAM (Segment Anything)
  • dense captioner (DC)
  • monocular depth estimator
  • CLIP
  • PixArt

Metrics

  • BLEU
  • ROUGE-L
  • METEOR
  • SPICE
  • WMD
  • CLIP-score
  • DINO-score
  • ARI
  • Flesch-Kincaid (FK)
  • SMOG

Datasets

  • COCO
  • CC3M
  • CC12M
  • LAION
  • IT-170K (this paper)

Benchmarks

  • DID-Bench
  • D2I-Bench
  • LIN-Bench
  • POPE