HaELM: an LLM-based, low-cost evaluator to detect and analyze hallucinations in vision-language models

August 29, 20237 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

26

Authors

Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, Jitao Sang, Haoyu Tang

Links

Abstract / PDF

Why It Matters For Business

Hallucinations make multimodal systems unreliable and risky. HaELM offers a cheaper, local way to measure hallucination and run repeated checks without sending data to external APIs.

Summary TLDR

The paper studies hallucinations (false statements not supported by an image) in large vision-language models (LVLMs). It shows that prior object-query tests are biased by prompts and poorly reflect real-world hallucination. The authors build HaELM, an LLM-based evaluator fine-tuned with simulated hallucination examples. HaELM matches ~95% of ChatGPT's performance (61% vs 64% accuracy on human labels), runs locally, costs less, and is reproducible. Using HaELM they quantify how prompts, output length, sampling (top-K) and temperature increase hallucination and give practical tips: prefer concise captions, lower sampling/temperature, and evaluate models like LLaVA that trade less hallucinated

Problem Statement

Vision-language models sometimes state things not visible in the image (hallucinations). Existing object-query tests ("Is there a {object}?") are heavily biased by the prompt and overestimate hallucination. We need a real-world evaluation that understands full free-form descriptions and matches human judgment.

Main Contribution

Showed object-query evaluation is prompt-sensitive and poorly reflects real captions.

Proposed HaELM: an LLM-based hallucination evaluator trained on simulated and real LVLM responses; performs close to ChatGPT while being cheap and local.

Measured hallucination across open LVLMs and analyzed drivers (prompts, length, sampling, temperature) with practical mitigation suggestions.

Key Findings

Object-query tests trigger affirmation bias: models answer "yes" >80% for absent objects but real caption hallucination is <10%.

NumbersAY >80%; CH <10% (Figure 2, Appendix Tables 9-11)

HaELM matches ~95% of ChatGPT's performance while being cheaper and run locally.

NumbersAccuracy HaELM 61% vs ChatGPT 64% (Table 1)

Hallucination rates differ widely across LVLMs: LLaVA ~19.4%, MiniGPT-4 ~55.0%, mPLUG-Owl ~36.2% on MS-COCO prompts.

NumbersAvg hallucination: LLaVA 19.4%; MiniGPT-4 55.0%; mPLUG-Owl 36.2% (Table 4)

Longer outputs, higher top-K sampling and higher temperature all raise hallucination.

NumbersLength 128→1024: 33.1%→37% (Table 5); K=1→5: 24.7%→42.4% (Table 6); temp 0.2→1: 24.7%→35.9% (Table 12)

HaELM has different bias than ChatGPT: it leans toward "no hallucination" while ChatGPT tends to flag hallucinations.

NumbersHaELM favors non-hallucination precision/recall patterns vs ChatGPT (Tables 1-2)

Results

Accuracy

ValueHaELM 61% vs ChatGPT 64%

BaselineChatGPT 64%

average F1 by LVLM (HaELM)

Value88%, 99%, 88% (three LVLMs respectively)

hallucination ratio by model

ValueLLaVA 19.4%; MiniGPT-4 55.0%; mPLUG-Owl 36.2%

effect of max generation length

Value33.1% (128) → 37% (1024)

Baseline128 tokens

effect of top-K sampling

ValueK=1: 24.7% → K=5: 42.4%

BaselineK=1

effect of temperature

Valuetemp 0.2: 24.7% → temp 1.0: 35.9%

Baselinetemp 0.2

time & monetary cost (one-time setup + per-eval)

ValueCollection 1.8h/$4.3 + Training 2h; Per evaluation 0.2h

BaselineChatGPT per-eval 1.6h/$6.6

Who Should Care

What To Try In 7 Days

Run HaELM locally on a small set of your images to get baseline hallucination rates.

Compare two LVLMs on your data; prefer models with lower hallucination for safety-sensitive features.

Reduce max output length and lower sampling K/temperature to cut hallucinations in production captions.

Reproducibility

Data Urls

  • MS-COCO 2014 (public); simulated data released in repo

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • HaELM and ChatGPT do not reach human-level hallucination judgment; captions substitute for image perception.
  • Simulated hallucination data cannot fully cover real hallucination patterns, causing recall gaps.
  • The paper analyzes triggers but does not fix the root causes of hallucination during model training.

When Not To Use

  • When you need human-level, high-stakes verification of image claims.
  • If your workflow requires direct multimodal understanding (not caption-based comparison).

Failure Modes

  • Evaluator bias: HaELM leans toward "no hallucination" while other judges flag more false positives.
  • Simulation mismatch: generated training hallucinations differ from real model errors.
  • Prompt sensitivity: evaluation scores depend on prompt choices and human reference captions.

Core Entities

Models

  • HaELM
  • ChatGPT
  • LLaVA
  • MiniGPT-4
  • mPLUG-Owl
  • LLaMA
  • Vicuna

Metrics

  • Accuracy
  • precision
  • recall
  • F1
  • hallucination ratio
  • time and cost

Datasets

  • MS-COCO 2014