Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
3
Why It Matters For Business
IT automates high-quality, detailed captions that improve downstream image generation and reduce hallucination in vision-language systems, lowering labeling costs and improving model usefulness in search, generation, and retrieval.
Summary TLDR
The paper introduces Image Textualization (IT): a three-stage pipeline that (1) uses a multimodal LLM to produce a template caption, (2) extracts many fine-grained visual facts with vision expert models (dense captioner, detector, SAM, depth estimator) and detects hallucinations, and (3) asks a text-only LLM to re-write a detailed, low-hallucination caption from the textualized facts. The authors release an IT-170K dataset and three new benchmarks (DID-, D2I-, LIN-Bench). On those benchmarks, IT captions are longer, contain more object detail, reduce hallucination, and improve downstream MLLM fine-tuning (e.g., LLaVA-7B fine-tuned on IT data shows large gains on BLEU and semantic metrics).
Problem Statement
Existing image-caption data is either scraped and noisy or human-labeled and short and costly. Pure MLLMs hallucinate and miss fine detail. Vision models see local detail but lack holistic language fluency. We need an automated, scalable way to produce detailed, accurate image descriptions for training and downstream tasks.
Main Contribution
Image Textualization (IT): a 3-phase automatic pipeline that combines MLLMs, vision expert models, and LLM recaptioning for detailed captions.
Three evaluation benchmarks for long/detailed captions: DID-Bench (human-checked references), D2I-Bench (description→image similarity), and LIN-Bench (linguistic richness).
IT-170K: a released dataset of IT-generated captions and open-source code on GitHub and HuggingFace.
Key Findings
IT captions are substantially more informative and closer to human references than raw MLLM captions on automatic caption metrics.
Descriptions from IT produce text-to-image outputs that match originals better by embedding similarity.
IT captions are longer and use more content words (nouns, verbs, adjectives).
Fine-tuning an MLLM with IT data reduces hallucination and increases descriptive richness.
Results
BLEU-1 (DID-Bench, combined GT)
D2I similarity (CLIP-score)
Average words per description (LIN-Bench stat)
POPE (hallucination benchmark)
Who Should Care
What To Try In 7 Days
Run IT pipeline on a small image subset to compare IT captions vs existing captions.
Fine-tune an MLLM (e.g., LLaVA-7B) on IT captions and evaluate hallucination on POPE.
Use IT captions to condition a text-to-image model and check D2I similarity to originals.
Reproducibility
License
- Code MIT; data Apache-2.0
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Did not tune or test very large MLLMs (e.g., LLaVA-70B) due to compute limits, so gains on larger models are untested.
- Benchmarks and automatic metrics can be sensitive to caption style; the paper notes evaluation depends on MLLM style and uses multiple ground-truth splits.
- IT relies on the coverage and accuracy of the vision expert models; missing or wrong detections will limit caption correctness.
When Not To Use
- When you need strictly human-authored stylistic captions (e.g., marketing copy) rather than detailed factual descriptions.
- If you lack access to reliable vision expert models (dense captioners, detectors, SAM, depth estimator).
- When compute or latency constraints prevent running multiple vision models and LLM recaptioning.
Failure Modes
- Missed or incorrect object detections lead to missing details or false removals during hallucination detection.
- Depth or size estimation errors can flip spatial relations, causing wrong 'foreground/background' claims.
- Style bias from the initial MLLM reference description can affect final wording and automatic-metric scores.
Core Entities
Models
- LLaVA-7B
- GPT4-V
- GroundingDINO
- SAM (Segment Anything)
- dense captioner (DC)
- monocular depth estimator
- CLIP
- PixArt
Metrics
- BLEU
- ROUGE-L
- METEOR
- SPICE
- WMD
- CLIP-score
- DINO-score
- ARI
- Flesch-Kincaid (FK)
- SMOG
Datasets
- COCO
- CC3M
- CC12M
- LAION
- IT-170K (this paper)
Benchmarks
- DID-Bench
- D2I-Bench
- LIN-Bench
- POPE

