Med-HALT: a public benchmark that tests LLM hallucinations on medical multiple-choice and PubMed retrieval tasks

Overview

Decision SnapshotNeeds Validation

Med‑HALT is a practical, public benchmark that reveals large error rates for current LLMs on medical reasoning and retrieval; it is useful for risk assessment and model selection but does not make models production‑safe by itself.

Citations17

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 20%

Novelty: 55%

Authors

Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you plan to use LLMs for medical content or literature retrieval, expect frequent confident errors unless you add external retrieval, verification, or human oversight; Med‑HALT lets you measure that risk quantitatively.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

Med-HALT is a new, openly shared benchmark for measuring hallucination in LLMs on medical tasks. It combines ~18,866 reasoning (multiple-choice) samples from multinational medical exams and 4,916 PubMed retrieval samples. The suite includes reasoning tests (False Confidence, None‑of‑the‑Above, Fake Questions) and memory/retrieval tests (PMID↔title, title↔link, abstract↔link, link↔title). Evaluations of GPT-3.5, Text‑Davinci, Llama‑2, Falcon and MPT show large variation: open models (Falcon, Llama‑2) often outperform commercial models on these tasks but no model is close to safe clinical accuracy. The benchmark also measures parsing/format errors and probes prompt, temperature, instruction‑t‑

Problem Statement

Large language models can produce confident but incorrect medical statements (hallucinations). There was no focused, public benchmark that measures reasoning and memory hallucinations on realistic, multilingual medical exam questions and PubMed retrieval tasks.

Main Contribution

Med-HALT dataset: ~18,866 reasoning MCQs from multinational medical exams and 4,916 PubMed IR samples.

A test suite split into Reasoning Hallucination Tests (False Confidence, NOTA, Fake Questions) and Memory Hallucination Tests (PMID/title/link/abstract retrieval).

Key Findings

No model achieved clinical-grade accuracy on reasoning hallucination tests.

NumbersLlama‑2 70B Reasoning FCT accuracy 42.21% (Table 2)

Practical UseDo not use these off‑the‑shelf models for unsupervised clinical decision making; expect frequent incorrect confident answers.

Evidence RefTable 2

Models vary widely by test type; some excel at detecting fake questions but fail on reasoning.

NumbersFalcon 40B Reasoning Fake accuracy 99.89%, but Reasoning FCT accuracy 18.66% (Table 2)

Practical UseEvaluate models on the exact failure modes you care about (fake vs reasoning). High performance on one task does not generalize.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Llama-2 70B: 42.21%	—	—	Reasoning FCT (Med-HALT)	Table 2 reports 42.21% accuracy for Llama-2 70B on Reasoning FCT	Table 2
Accuracy	Falcon 40B: 99.89%	—	—	Reasoning Fake (Med-HALT)	Table 2 shows Falcon 40B 99.89% accuracy on Reasoning Fake	Table 2

What To Try In 7 Days

Run Med‑HALT on your model to get a baseline on reasoning and retrieval errors.

Measure parsing/format error rates and treat malformed outputs as failures.

Add 2–3 high-quality few‑shot examples and compare gains for your use case.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://medhalt.github.io

Data URLs

https://medhalt.github.io

Risks & Boundaries

Limitations

Focus limited to multiple-choice reasoning and PubMed retrieval; does not cover free‑text generation or clinical dialogue.

GPT‑4 was not evaluated due to budget constraints, so comparisons to the very best closed models are missing.

When Not To Use

Do not use evaluated models for direct clinical decision making without external verification.

Do not assume high performance on non‑MCQ or conversational clinical tasks from these results.

Failure Modes

Confident but incorrect answers on reasoning questions (hallucination).

Incorrect mapping in PubMed retrieval tasks (false positives or wrong titles/links).

Core Entities

Models

Text-Davinci-003GPT-3.5 TurboLlama-2 70BLlama-2 70B-chatLlama-2 13BLlama-2 13B-chatLlama-2 7BLlama-2 7B-chatFalcon 40BFalcon 40B-instructMPT-7BMPT-7B-instruct

Metrics

AccuracyPointwise Score (exam-style)Parsing/format error rate

Datasets

MedMCQAHeadQAMedQA (USMILE)TWMLE (Taiwan)PubMed subset (Med-HALT IR)

Benchmarks

Med-HALT (RHT + MHT)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

No model achieved clinical-grade accuracy on reasoning hallucination tests.

Models vary widely by test type; some excel at detecting fake questions but fail on reasoning.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding