Overview
Med‑HALT is a practical, public benchmark that reveals large error rates for current LLMs on medical reasoning and retrieval; it is useful for risk assessment and model selection but does not make models production‑safe by itself.
Citations17
Evidence Strength0.60
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/7
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 20%
Novelty: 55%
Why It Matters For Business
If you plan to use LLMs for medical content or literature retrieval, expect frequent confident errors unless you add external retrieval, verification, or human oversight; Med‑HALT lets you measure that risk quantitatively.
Who Should Care
Summary TLDR
Med-HALT is a new, openly shared benchmark for measuring hallucination in LLMs on medical tasks. It combines ~18,866 reasoning (multiple-choice) samples from multinational medical exams and 4,916 PubMed retrieval samples. The suite includes reasoning tests (False Confidence, None‑of‑the‑Above, Fake Questions) and memory/retrieval tests (PMID↔title, title↔link, abstract↔link, link↔title). Evaluations of GPT-3.5, Text‑Davinci, Llama‑2, Falcon and MPT show large variation: open models (Falcon, Llama‑2) often outperform commercial models on these tasks but no model is close to safe clinical accuracy. The benchmark also measures parsing/format errors and probes prompt, temperature, instruction‑t‑
Problem Statement
Large language models can produce confident but incorrect medical statements (hallucinations). There was no focused, public benchmark that measures reasoning and memory hallucinations on realistic, multilingual medical exam questions and PubMed retrieval tasks.
Main Contribution
Med-HALT dataset: ~18,866 reasoning MCQs from multinational medical exams and 4,916 PubMed IR samples.
A test suite split into Reasoning Hallucination Tests (False Confidence, NOTA, Fake Questions) and Memory Hallucination Tests (PMID/title/link/abstract retrieval).
Key Findings
No model achieved clinical-grade accuracy on reasoning hallucination tests.
Models vary widely by test type; some excel at detecting fake questions but fail on reasoning.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | Llama-2 70B: 42.21% | — | — | Reasoning FCT (Med-HALT) | Table 2 reports 42.21% accuracy for Llama-2 70B on Reasoning FCT | Table 2 |
| Accuracy | Falcon 40B: 99.89% | — | — | Reasoning Fake (Med-HALT) | Table 2 shows Falcon 40B 99.89% accuracy on Reasoning Fake | Table 2 |
What To Try In 7 Days
Run Med‑HALT on your model to get a baseline on reasoning and retrieval errors.
Measure parsing/format error rates and treat malformed outputs as failures.
Add 2–3 high-quality few‑shot examples and compare gains for your use case.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Focus limited to multiple-choice reasoning and PubMed retrieval; does not cover free‑text generation or clinical dialogue.
GPT‑4 was not evaluated due to budget constraints, so comparisons to the very best closed models are missing.
When Not To Use
Do not use evaluated models for direct clinical decision making without external verification.
Do not assume high performance on non‑MCQ or conversational clinical tasks from these results.
Failure Modes
Confident but incorrect answers on reasoning questions (hallucination).
Incorrect mapping in PubMed retrieval tasks (false positives or wrong titles/links).

