HaELM: an LLM-based, low-cost evaluator to detect and analyze hallucinations in vision-language models

August 29, 20237 min

Overview

Decision SnapshotNeeds Validation

HaELM is a practical and cheaper evaluator that matches ChatGPT closely on the tested data, but it is not human-level and relies on captions as proxies for images, so validate on a small human set before full deployment.

Citations26

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/7

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 60%

Authors

Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, Jitao Sang, Haoyu Tang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Hallucinations make multimodal systems unreliable and risky. HaELM offers a cheaper, local way to measure hallucination and run repeated checks without sending data to external APIs.

Who Should Care

Summary TLDR

The paper studies hallucinations (false statements not supported by an image) in large vision-language models (LVLMs). It shows that prior object-query tests are biased by prompts and poorly reflect real-world hallucination. The authors build HaELM, an LLM-based evaluator fine-tuned with simulated hallucination examples. HaELM matches ~95% of ChatGPT's performance (61% vs 64% accuracy on human labels), runs locally, costs less, and is reproducible. Using HaELM they quantify how prompts, output length, sampling (top-K) and temperature increase hallucination and give practical tips: prefer concise captions, lower sampling/temperature, and evaluate models like LLaVA that trade less hallucinated

Problem Statement

Vision-language models sometimes state things not visible in the image (hallucinations). Existing object-query tests ("Is there a {object}?") are heavily biased by the prompt and overestimate hallucination. We need a real-world evaluation that understands full free-form descriptions and matches human judgment.

Main Contribution

Showed object-query evaluation is prompt-sensitive and poorly reflects real captions.

Proposed HaELM: an LLM-based hallucination evaluator trained on simulated and real LVLM responses; performs close to ChatGPT while being cheap and local.

Key Findings

Object-query tests trigger affirmation bias: models answer "yes" >80% for absent objects but real caption hallucination is <10%.

NumbersAY >80%; CH <10% (Figure 2, Appendix Tables 9-11)

Practical UseDo not rely on simple "Is there X?" tests. Use full-description evaluation or human-aligned judges instead.

Evidence RefFigure 2; Appendix Tables 9-11

HaELM matches ~95% of ChatGPT's performance while being cheaper and run locally.

NumbersAccuracy HaELM 61% vs ChatGPT 64% (Table 1)

Practical UseUse HaELM for repeated or private evaluations to save money and keep data local.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyHaELM 61% vs ChatGPT 64%ChatGPT 64%-3 pphuman-annotated LVLM responses (MS-COCO test)HaELM reaches ~95% of ChatGPT's levelTable 1
average F1 by LVLM (HaELM)88%, 99%, 88% (three LVLMs respectively)human-annotated evaluationAverage F1 scores reported for HaELMTable 2 (text)

What To Try In 7 Days

Run HaELM locally on a small set of your images to get baseline hallucination rates.

Compare two LVLMs on your data; prefer models with lower hallucination for safety-sensitive features.

Reduce max output length and lower sampling K/temperature to cut hallucinations in production captions.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

MS-COCO 2014 (public); simulated data released in repo

Risks & Boundaries

Limitations

HaELM and ChatGPT do not reach human-level hallucination judgment; captions substitute for image perception.

Simulated hallucination data cannot fully cover real hallucination patterns, causing recall gaps.

When Not To Use

When you need human-level, high-stakes verification of image claims.

If your workflow requires direct multimodal understanding (not caption-based comparison).

Failure Modes

Evaluator bias: HaELM leans toward "no hallucination" while other judges flag more false positives.

Simulation mismatch: generated training hallucinations differ from real model errors.

Core Entities

Models

HaELMChatGPTLLaVAMiniGPT-4mPLUG-OwlLLaMAVicuna

Metrics

AccuracyprecisionrecallF1hallucination ratiotime and cost

Datasets

MS-COCO 2014