Overview
HaELM is a practical and cheaper evaluator that matches ChatGPT closely on the tested data, but it is not human-level and relies on captions as proxies for images, so validate on a small human set before full deployment.
Citations26
Evidence Strength0.70
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/7
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
Hallucinations make multimodal systems unreliable and risky. HaELM offers a cheaper, local way to measure hallucination and run repeated checks without sending data to external APIs.
Who Should Care
Summary TLDR
The paper studies hallucinations (false statements not supported by an image) in large vision-language models (LVLMs). It shows that prior object-query tests are biased by prompts and poorly reflect real-world hallucination. The authors build HaELM, an LLM-based evaluator fine-tuned with simulated hallucination examples. HaELM matches ~95% of ChatGPT's performance (61% vs 64% accuracy on human labels), runs locally, costs less, and is reproducible. Using HaELM they quantify how prompts, output length, sampling (top-K) and temperature increase hallucination and give practical tips: prefer concise captions, lower sampling/temperature, and evaluate models like LLaVA that trade less hallucinated
Problem Statement
Vision-language models sometimes state things not visible in the image (hallucinations). Existing object-query tests ("Is there a {object}?") are heavily biased by the prompt and overestimate hallucination. We need a real-world evaluation that understands full free-form descriptions and matches human judgment.
Main Contribution
Showed object-query evaluation is prompt-sensitive and poorly reflects real captions.
Proposed HaELM: an LLM-based hallucination evaluator trained on simulated and real LVLM responses; performs close to ChatGPT while being cheap and local.
Key Findings
Object-query tests trigger affirmation bias: models answer "yes" >80% for absent objects but real caption hallucination is <10%.
HaELM matches ~95% of ChatGPT's performance while being cheaper and run locally.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | HaELM 61% vs ChatGPT 64% | ChatGPT 64% | -3 pp | human-annotated LVLM responses (MS-COCO test) | HaELM reaches ~95% of ChatGPT's level | Table 1 |
| average F1 by LVLM (HaELM) | 88%, 99%, 88% (three LVLMs respectively) | — | — | human-annotated evaluation | Average F1 scores reported for HaELM | Table 2 (text) |
What To Try In 7 Days
Run HaELM locally on a small set of your images to get baseline hallucination rates.
Compare two LVLMs on your data; prefer models with lower hallucination for safety-sensitive features.
Reduce max output length and lower sampling K/temperature to cut hallucinations in production captions.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
HaELM and ChatGPT do not reach human-level hallucination judgment; captions substitute for image perception.
Simulated hallucination data cannot fully cover real hallucination patterns, causing recall gaps.
When Not To Use
When you need human-level, high-stakes verification of image claims.
If your workflow requires direct multimodal understanding (not caption-based comparison).
Failure Modes
Evaluator bias: HaELM leans toward "no hallucination" while other judges flag more false positives.
Simulation mismatch: generated training hallucinations differ from real model errors.

