Med-HALT: a public benchmark that tests LLM hallucinations on medical multiple-choice and PubMed retrieval tasks

July 28, 20238 min

Overview

Production Readiness

0.2

Novelty Score

0.55

Cost Impact Score

0.3

Citation Count

17

Authors

Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu

Links

Abstract / PDF

Why It Matters For Business

If you plan to use LLMs for medical content or literature retrieval, expect frequent confident errors unless you add external retrieval, verification, or human oversight; Med‑HALT lets you measure that risk quantitatively.

Summary TLDR

Med-HALT is a new, openly shared benchmark for measuring hallucination in LLMs on medical tasks. It combines ~18,866 reasoning (multiple-choice) samples from multinational medical exams and 4,916 PubMed retrieval samples. The suite includes reasoning tests (False Confidence, None‑of‑the‑Above, Fake Questions) and memory/retrieval tests (PMID↔title, title↔link, abstract↔link, link↔title). Evaluations of GPT-3.5, Text‑Davinci, Llama‑2, Falcon and MPT show large variation: open models (Falcon, Llama‑2) often outperform commercial models on these tasks but no model is close to safe clinical accuracy. The benchmark also measures parsing/format errors and probes prompt, temperature, instruction‑t‑

Problem Statement

Large language models can produce confident but incorrect medical statements (hallucinations). There was no focused, public benchmark that measures reasoning and memory hallucinations on realistic, multilingual medical exam questions and PubMed retrieval tasks.

Main Contribution

Med-HALT dataset: ~18,866 reasoning MCQs from multinational medical exams and 4,916 PubMed IR samples.

A test suite split into Reasoning Hallucination Tests (False Confidence, NOTA, Fake Questions) and Memory Hallucination Tests (PMID/title/link/abstract retrieval).

Baseline evaluation of commercial and open LLMs (Text‑Davinci, GPT‑3.5, Llama‑2, Falcon, MPT) using accuracy, pointwise exam score, parsing failure rates, and sensitivity studies.

Open release of benchmark and evaluation design (medhalt.github.io) to support reproducible research.

Key Findings

No model achieved clinical-grade accuracy on reasoning hallucination tests.

NumbersLlama‑2 70B Reasoning FCT accuracy 42.21% (Table 2)

Models vary widely by test type; some excel at detecting fake questions but fail on reasoning.

NumbersFalcon 40B Reasoning Fake accuracy 99.89%, but Reasoning FCT accuracy 18.66% (Table 2)

Memory/retrieval accuracy is low but open models perform better on average.

NumbersFalcon 40B IR average accuracy 30.36% vs GPT‑3.5 19.96% (Table 3)

Few‑shot examples help, but gains plateau after a few exemplars.

NumbersGPT‑3.5 zero‑shot accuracy ~7.31%; accuracy improves with shots but plateaus after ~3 examples (Section 7.2)

Instruction tuning and RLHF can worsen hallucination control for some models.

NumbersAuthors report instruction tuning had a more harmful effect on Llama models; effect smaller on OpenAI and Falcon (Sec 6.

Format‑parsing failures are a measurable reliability signal.

NumbersParsing error rates: GPT‑3.5 1–3% across tasks; some chat variants show high parsing failure e.g., Llama‑2‑70B‑chat 41.1

Results

Accuracy

ValueLlama-2 70B: 42.21%

Accuracy

ValueFalcon 40B: 99.89%

Accuracy

ValueLlama-2 70B: 77.53%

Accuracy

ValueFalcon 40B: 30.36%

Accuracy

ValueFalcon 40B: 42.46% (reported average across tasks)

Zero-shot baseline (GPT-3.5)

Value7.31% accuracy (zero-shot)

Parsing/format exception rates

ValueGPT-3.5 ~1–3%; Llama-2-70B-chat up to 41.1% on some tasks

Who Should Care

What To Try In 7 Days

Run Med‑HALT on your model to get a baseline on reasoning and retrieval errors.

Measure parsing/format error rates and treat malformed outputs as failures.

Add 2–3 high-quality few‑shot examples and compare gains for your use case.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focus limited to multiple-choice reasoning and PubMed retrieval; does not cover free‑text generation or clinical dialogue.
  • GPT‑4 was not evaluated due to budget constraints, so comparisons to the very best closed models are missing.
  • Instruction‑tuning effects are observed but not deeply analyzed for mitigation strategies.
  • Some datasets combine multiple public exam sources which may have sampling biases by country or exam style.

When Not To Use

  • Do not use evaluated models for direct clinical decision making without external verification.
  • Do not assume high performance on non‑MCQ or conversational clinical tasks from these results.
  • Avoid relying solely on instruction tuning to reduce hallucinations; test empirically.

Failure Modes

  • Confident but incorrect answers on reasoning questions (hallucination).
  • Incorrect mapping in PubMed retrieval tasks (false positives or wrong titles/links).
  • Malformed or unparsable JSON outputs breaking downstream pipelines.
  • High sensitivity to prompt wording, temperature, and exemplar choices.

Core Entities

Models

  • Text-Davinci-003
  • GPT-3.5 Turbo
  • Llama-2 70B
  • Llama-2 70B-chat
  • Llama-2 13B
  • Llama-2 13B-chat
  • Llama-2 7B
  • Llama-2 7B-chat
  • Falcon 40B
  • Falcon 40B-instruct
  • MPT-7B
  • MPT-7B-instruct

Metrics

  • Accuracy
  • Pointwise Score (exam-style)
  • Parsing/format error rate

Datasets

  • MedMCQA
  • HeadQA
  • MedQA (USMILE)
  • TWMLE (Taiwan)
  • PubMed subset (Med-HALT IR)

Benchmarks

  • Med-HALT (RHT + MHT)