MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

September 11, 20248 min

Overview

Decision SnapshotReady For Pilot

MEDIC is ready for research and engineering use as a leading-indicator suite; it helps prioritize models but cannot replace real-world pilots or clinician sign-off.

Citations6

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Praveenkumar Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Prateek Munjal, Nada Saadi, Hamza A Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MEDIC gives practical, faster checks for clinical readiness: it flags operational failures and hallucinations that standard exams miss, reducing deployment risk before costly pilots.

Who Should Care

Summary TLDR

MEDIC is a modular evaluation framework for clinical LLMs that stresses operational tasks, safety auditing, and reference-free factuality checks. It adds a Cross-Examination Framework (CEF) to verify summaries without human references, and runs deterministic execution tests (e.g., SQL, clinical calculations). Key findings: (1) static medical knowledge (USMLE-style) is saturated and does not predict success on precise operational tasks (MedCalc, EHRSQL); (2) 'passive' safety (refusal) is not the same as 'active' safety (error detection); (3) no single model dominates across tasks, and scale does not guarantee factual conformity. Public leaderboard and code for CEF are provided.

Problem Statement

Standard medical benchmarks (MCQs, USMLE-style) are saturated and fail to predict whether an LLM can perform precise clinical operations (calculations, SQL, error auditing). Teams need fast, offline leading indicators that catch operational failures and hallucinations before costly, risky pilots.

Main Contribution

Define MEDIC — a five-dimension evaluation framework (Medical reasoning, Ethics & bias, Data & language, In-context learning, Clinical safety) focused on functional clinical utility.

Introduce the Cross-Examination Framework (CEF) — a reference-free, question-based verifier that measures Coverage, Conformity, Consistency, and Conciseness.

Key Findings

Static knowledge benchmarks are saturated, but operational tasks lag far behind.

NumbersKnowledge median >75% vs operational median <40% (Fig.4a)

Practical UseDo not use USMLE-style accuracy alone to approve models for clinical pipelines; separately validate calculations and database queries.

Evidence RefSection 3.3, Figure 4a

Passive safety (refusal) is near-perfect while active error detection is poor.

NumbersMed-Safety scores ≈1; MEDEC detection often drops to near-zero (Fig.4b)

Practical UseEvaluate both refusal and auditing ability; a model that refuses harm may still miss factual errors in notes.

Evidence RefSection 3.4, Figure 4b, Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy>75%state-of-the-art saturating MCQsMedQA, MedMCQA (knowledge tasks)Top-tier models show median >75% on knowledge tasksSection 3.3, Figure 4a
Accuracy<40%knowledge-task median≈ -35 percentage pointsMedCalc, EHRSQL (operational tasks)Operational tasks show median <40% while knowledge tasks are >75%Section 3.3, Figure 4a

What To Try In 7 Days

Run MEDIC's MedCalc and EHRSQL tasks on candidate models to catch arithmetic and SQL failures.

Apply CEF to your note-generation outputs to measure Coverage, Consistency, and Conformity without references.

Add an active-safety test (MEDEC or similar) to measure error-detection, not just refusal behavior.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

References to public datasets used (MedCalc, EHRSQL, ACI-Bench, DischargeMe, MEDEC repositories as cited)Dataset links cited in Appendix A.3 and references

Risks & Boundaries

Limitations

LLM-as-a-judge methods can inherit judge biases (length, self-preference) despite high inter-judge agreement.

Many safety datasets are physician-centric and may not cover patient or nursing concerns.

When Not To Use

Do not treat MEDIC scores as sufficient clinical validation for deployment without human trials.

Do not rely solely on CEF or LLM-judges to certify safety in high-risk clinical decisions.

Failure Modes

Automation bias: over-trusting leading indicators and skipping human validation.

Goodhart's Law: models optimized to game MEDIC tasks may hide unmeasured failure modes.

Core Entities

Models

GPT-OSS-120BGPT-OSS-20BLlama-4-MaverickDeepSeek-V3.1Med42-v2-8BMistral-Large-3-675BQwen2.5-72BKimi-K2-ThinkingPhi-4

Metrics

AccuracyExecution success / exact matchRS(0) (Reliability Score)Coverage / Conformity / Consistency / Conciseness (CEF)Elo (pairwise)Refusal / Harmfulness scoreF1 (error detection)ROUGE / BLEU / BERTScore

Datasets

MedQAMedMCQAMMLU-ProMedCalcEHRSQLDischargeMeACI-BenchMEDECMedicationQAHealthSearchQAExpertQAGSM8KAIMEIFEvalMed-SafetyToxiGenPubMedQAHealthBench

Benchmarks

MEDIC frameworkCross-Examination Framework (CEF)EHRSQL Reliability Score (RS(0))MEDEC stages