Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
26
Why It Matters For Business
Hallucinations make multimodal systems unreliable and risky. HaELM offers a cheaper, local way to measure hallucination and run repeated checks without sending data to external APIs.
Summary TLDR
The paper studies hallucinations (false statements not supported by an image) in large vision-language models (LVLMs). It shows that prior object-query tests are biased by prompts and poorly reflect real-world hallucination. The authors build HaELM, an LLM-based evaluator fine-tuned with simulated hallucination examples. HaELM matches ~95% of ChatGPT's performance (61% vs 64% accuracy on human labels), runs locally, costs less, and is reproducible. Using HaELM they quantify how prompts, output length, sampling (top-K) and temperature increase hallucination and give practical tips: prefer concise captions, lower sampling/temperature, and evaluate models like LLaVA that trade less hallucinated
Problem Statement
Vision-language models sometimes state things not visible in the image (hallucinations). Existing object-query tests ("Is there a {object}?") are heavily biased by the prompt and overestimate hallucination. We need a real-world evaluation that understands full free-form descriptions and matches human judgment.
Main Contribution
Showed object-query evaluation is prompt-sensitive and poorly reflects real captions.
Proposed HaELM: an LLM-based hallucination evaluator trained on simulated and real LVLM responses; performs close to ChatGPT while being cheap and local.
Measured hallucination across open LVLMs and analyzed drivers (prompts, length, sampling, temperature) with practical mitigation suggestions.
Key Findings
Object-query tests trigger affirmation bias: models answer "yes" >80% for absent objects but real caption hallucination is <10%.
HaELM matches ~95% of ChatGPT's performance while being cheaper and run locally.
Hallucination rates differ widely across LVLMs: LLaVA ~19.4%, MiniGPT-4 ~55.0%, mPLUG-Owl ~36.2% on MS-COCO prompts.
Longer outputs, higher top-K sampling and higher temperature all raise hallucination.
HaELM has different bias than ChatGPT: it leans toward "no hallucination" while ChatGPT tends to flag hallucinations.
Results
Accuracy
average F1 by LVLM (HaELM)
hallucination ratio by model
effect of max generation length
effect of top-K sampling
effect of temperature
time & monetary cost (one-time setup + per-eval)
Who Should Care
What To Try In 7 Days
Run HaELM locally on a small set of your images to get baseline hallucination rates.
Compare two LVLMs on your data; prefer models with lower hallucination for safety-sensitive features.
Reduce max output length and lower sampling K/temperature to cut hallucinations in production captions.
Reproducibility
Data Urls
- MS-COCO 2014 (public); simulated data released in repo
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- HaELM and ChatGPT do not reach human-level hallucination judgment; captions substitute for image perception.
- Simulated hallucination data cannot fully cover real hallucination patterns, causing recall gaps.
- The paper analyzes triggers but does not fix the root causes of hallucination during model training.
When Not To Use
- When you need human-level, high-stakes verification of image claims.
- If your workflow requires direct multimodal understanding (not caption-based comparison).
Failure Modes
- Evaluator bias: HaELM leans toward "no hallucination" while other judges flag more false positives.
- Simulation mismatch: generated training hallucinations differ from real model errors.
- Prompt sensitivity: evaluation scores depend on prompt choices and human reference captions.
Core Entities
Models
- HaELM
- ChatGPT
- LLaVA
- MiniGPT-4
- mPLUG-Owl
- LLaMA
- Vicuna
Metrics
- Accuracy
- precision
- recall
- F1
- hallucination ratio
- time and cost
Datasets
- MS-COCO 2014

