MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

September 11, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

6

Authors

Praveenkumar Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Prateek Munjal, Nada Saadi, Hamza A Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan

Links

Abstract / PDF

Why It Matters For Business

MEDIC gives practical, faster checks for clinical readiness: it flags operational failures and hallucinations that standard exams miss, reducing deployment risk before costly pilots.

Summary TLDR

MEDIC is a modular evaluation framework for clinical LLMs that stresses operational tasks, safety auditing, and reference-free factuality checks. It adds a Cross-Examination Framework (CEF) to verify summaries without human references, and runs deterministic execution tests (e.g., SQL, clinical calculations). Key findings: (1) static medical knowledge (USMLE-style) is saturated and does not predict success on precise operational tasks (MedCalc, EHRSQL); (2) 'passive' safety (refusal) is not the same as 'active' safety (error detection); (3) no single model dominates across tasks, and scale does not guarantee factual conformity. Public leaderboard and code for CEF are provided.

Problem Statement

Standard medical benchmarks (MCQs, USMLE-style) are saturated and fail to predict whether an LLM can perform precise clinical operations (calculations, SQL, error auditing). Teams need fast, offline leading indicators that catch operational failures and hallucinations before costly, risky pilots.

Main Contribution

Define MEDIC — a five-dimension evaluation framework (Medical reasoning, Ethics & bias, Data & language, In-context learning, Clinical safety) focused on functional clinical utility.

Introduce the Cross-Examination Framework (CEF) — a reference-free, question-based verifier that measures Coverage, Conformity, Consistency, and Conciseness.

Assemble a heterogeneous task suite (MedCalc, EHRSQL, MEDEC, DischargeMe, ACI-Bench, MedQA, MedMCQA, etc.) and evaluate many open models under uniform harness conditions.

Reveal three practical gaps: knowledge vs execution, passive vs active safety, and task-dependent model heterogeneity.

Provide a public leaderboard and open CEF code to reproduce and extend evaluations.

Key Findings

Static knowledge benchmarks are saturated, but operational tasks lag far behind.

NumbersKnowledge median >75% vs operational median <40% (Fig.4a)

Passive safety (refusal) is near-perfect while active error detection is poor.

NumbersMed-Safety scores ≈1; MEDEC detection often drops to near-zero (Fig.4b)

Reference-free factuality checks (CEF) reveal hallucinations not captured by lexical metrics.

NumbersNegligible Spearman correlation between CEF and BLEU/ROUGE/BERTScore (Fig.3c)

No single architecture dominates; larger models can produce more contradictions.

NumbersHeterogeneous rank matrix; larger models cluster with lower conformity (Fig.2, Fig.3b)

Pairwise LLM-judge rankings for open-ended clinical QA are highly robust across judges.

NumbersInter-judge Spearman ρ ≥ 0.98 (Fig.5b)

Results

Accuracy

Value>75%

Baselinestate-of-the-art saturating MCQs

Accuracy

Value<40%

Baselineknowledge-task median

Accuracy

Value62.14%

Baselinevaries by model

CEF vs lexical metrics correlation

Value≈0 (negligible)

Baselineexpectation of correlation

Open-ended QA inter-judge rank correlation

Valueρ ≥ 0.98

Baselinerobust ranking

Who Should Care

What To Try In 7 Days

Run MEDIC's MedCalc and EHRSQL tasks on candidate models to catch arithmetic and SQL failures.

Apply CEF to your note-generation outputs to measure Coverage, Consistency, and Conformity without references.

Add an active-safety test (MEDEC or similar) to measure error-detection, not just refusal behavior.

Reproducibility

Data Urls

  • References to public datasets used (MedCalc, EHRSQL, ACI-Bench, DischargeMe, MEDEC repositories as cited)
  • Dataset links cited in Appendix A.3 and references

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • LLM-as-a-judge methods can inherit judge biases (length, self-preference) despite high inter-judge agreement.
  • Many safety datasets are physician-centric and may not cover patient or nursing concerns.
  • Automated metrics are leading indicators only and cannot substitute in-situ clinical validation.
  • Some models could not be evaluated on long-context tasks due to context-window or memory limits.

When Not To Use

  • Do not treat MEDIC scores as sufficient clinical validation for deployment without human trials.
  • Do not rely solely on CEF or LLM-judges to certify safety in high-risk clinical decisions.

Failure Modes

  • Automation bias: over-trusting leading indicators and skipping human validation.
  • Goodhart's Law: models optimized to game MEDIC tasks may hide unmeasured failure modes.
  • Judge bias: LLM evaluators can prefer certain styles or lengths.
  • Context limit failures: architectures with short context windows fail long-note tasks.

Core Entities

Models

  • GPT-OSS-120B
  • GPT-OSS-20B
  • Llama-4-Maverick
  • DeepSeek-V3.1
  • Med42-v2-8B
  • Mistral-Large-3-675B
  • Qwen2.5-72B
  • Kimi-K2-Thinking
  • Phi-4

Metrics

  • Accuracy
  • Execution success / exact match
  • RS(0) (Reliability Score)
  • Coverage / Conformity / Consistency / Conciseness (CEF)
  • Elo (pairwise)
  • Refusal / Harmfulness score
  • F1 (error detection)
  • ROUGE / BLEU / BERTScore

Datasets

  • MedQA
  • MedMCQA
  • MMLU-Pro
  • MedCalc
  • EHRSQL
  • DischargeMe
  • ACI-Bench
  • MEDEC
  • MedicationQA
  • HealthSearchQA
  • ExpertQA
  • GSM8K
  • AIME
  • IFEval
  • Med-Safety
  • ToxiGen
  • PubMedQA
  • HealthBench

Benchmarks

  • MEDIC framework
  • Cross-Examination Framework (CEF)
  • EHRSQL Reliability Score (RS(0))
  • MEDEC stages