HaELM: an LLM-based, low-cost evaluator to detect and analyze hallucinations in vision-language models

Overview

Decision SnapshotNeeds Validation

HaELM is a practical and cheaper evaluator that matches ChatGPT closely on the tested data, but it is not human-level and relies on captions as proxies for images, so validate on a small human set before full deployment.

Citations26

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/7

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 60%

Authors

Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, Jitao Sang, Haoyu Tang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Hallucinations make multimodal systems unreliable and risky. HaELM offers a cheaper, local way to measure hallucination and run repeated checks without sending data to external APIs.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

The paper studies hallucinations (false statements not supported by an image) in large vision-language models (LVLMs). It shows that prior object-query tests are biased by prompts and poorly reflect real-world hallucination. The authors build HaELM, an LLM-based evaluator fine-tuned with simulated hallucination examples. HaELM matches ~95% of ChatGPT's performance (61% vs 64% accuracy on human labels), runs locally, costs less, and is reproducible. Using HaELM they quantify how prompts, output length, sampling (top-K) and temperature increase hallucination and give practical tips: prefer concise captions, lower sampling/temperature, and evaluate models like LLaVA that trade less hallucinated

Problem Statement

Vision-language models sometimes state things not visible in the image (hallucinations). Existing object-query tests ("Is there a {object}?") are heavily biased by the prompt and overestimate hallucination. We need a real-world evaluation that understands full free-form descriptions and matches human judgment.

Main Contribution

Showed object-query evaluation is prompt-sensitive and poorly reflects real captions.

Proposed HaELM: an LLM-based hallucination evaluator trained on simulated and real LVLM responses; performs close to ChatGPT while being cheap and local.

Key Findings

Object-query tests trigger affirmation bias: models answer "yes" >80% for absent objects but real caption hallucination is <10%.

NumbersAY >80%; CH <10% (Figure 2, Appendix Tables 9-11)

Practical UseDo not rely on simple "Is there X?" tests. Use full-description evaluation or human-aligned judges instead.

Evidence RefFigure 2; Appendix Tables 9-11

HaELM matches ~95% of ChatGPT's performance while being cheaper and run locally.

NumbersAccuracy HaELM 61% vs ChatGPT 64% (Table 1)

Practical UseUse HaELM for repeated or private evaluations to save money and keep data local.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	HaELM 61% vs ChatGPT 64%	ChatGPT 64%	-3 pp	human-annotated LVLM responses (MS-COCO test)	HaELM reaches ~95% of ChatGPT's level	Table 1
average F1 by LVLM (HaELM)	88%, 99%, 88% (three LVLMs respectively)	—	—	human-annotated evaluation	Average F1 scores reported for HaELM	Table 2 (text)

What To Try In 7 Days

Run HaELM locally on a small set of your images to get baseline hallucination rates.

Compare two LVLMs on your data; prefer models with lower hallucination for safety-sensitive features.

Reduce max output length and lower sampling K/temperature to cut hallucinations in production captions.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/junyangwang0410/HaELM

Data URLs

MS-COCO 2014 (public); simulated data released in repo

Risks & Boundaries

Limitations

HaELM and ChatGPT do not reach human-level hallucination judgment; captions substitute for image perception.

Simulated hallucination data cannot fully cover real hallucination patterns, causing recall gaps.

When Not To Use

When you need human-level, high-stakes verification of image claims.

If your workflow requires direct multimodal understanding (not caption-based comparison).

Failure Modes

Evaluator bias: HaELM leans toward "no hallucination" while other judges flag more false positives.

Simulation mismatch: generated training hallucinations differ from real model errors.

Core Entities

Models

HaELMChatGPTLLaVAMiniGPT-4mPLUG-OwlLLaMAVicuna

Metrics

AccuracyprecisionrecallF1hallucination ratiotime and cost

Datasets

MS-COCO 2014

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Object-query tests trigger affirmation bias: models answer "yes" >80% for absent objects but real caption hallucination is <10%.

HaELM matches ~95% of ChatGPT's performance while being cheaper and run locally.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding