MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Overview

Decision SnapshotReady For Pilot

MEDIC is ready for research and engineering use as a leading-indicator suite; it helps prioritize models but cannot replace real-world pilots or clinician sign-off.

Citations6

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Praveenkumar Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Prateek Munjal, Nada Saadi, Hamza A Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MEDIC gives practical, faster checks for clinical readiness: it flags operational failures and hallucinations that standard exams miss, reducing deployment risk before costly pilots.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

MEDIC is a modular evaluation framework for clinical LLMs that stresses operational tasks, safety auditing, and reference-free factuality checks. It adds a Cross-Examination Framework (CEF) to verify summaries without human references, and runs deterministic execution tests (e.g., SQL, clinical calculations). Key findings: (1) static medical knowledge (USMLE-style) is saturated and does not predict success on precise operational tasks (MedCalc, EHRSQL); (2) 'passive' safety (refusal) is not the same as 'active' safety (error detection); (3) no single model dominates across tasks, and scale does not guarantee factual conformity. Public leaderboard and code for CEF are provided.

Problem Statement

Standard medical benchmarks (MCQs, USMLE-style) are saturated and fail to predict whether an LLM can perform precise clinical operations (calculations, SQL, error auditing). Teams need fast, offline leading indicators that catch operational failures and hallucinations before costly, risky pilots.

Main Contribution

Define MEDIC — a five-dimension evaluation framework (Medical reasoning, Ethics & bias, Data & language, In-context learning, Clinical safety) focused on functional clinical utility.

Introduce the Cross-Examination Framework (CEF) — a reference-free, question-based verifier that measures Coverage, Conformity, Consistency, and Conciseness.

Key Findings

Static knowledge benchmarks are saturated, but operational tasks lag far behind.

NumbersKnowledge median >75% vs operational median <40% (Fig.4a)

Practical UseDo not use USMLE-style accuracy alone to approve models for clinical pipelines; separately validate calculations and database queries.

Evidence RefSection 3.3, Figure 4a

Passive safety (refusal) is near-perfect while active error detection is poor.

NumbersMed-Safety scores ≈1; MEDEC detection often drops to near-zero (Fig.4b)

Practical UseEvaluate both refusal and auditing ability; a model that refuses harm may still miss factual errors in notes.

Evidence RefSection 3.4, Figure 4b, Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	>75%	state-of-the-art saturating MCQs	—	MedQA, MedMCQA (knowledge tasks)	Top-tier models show median >75% on knowledge tasks	Section 3.3, Figure 4a
Accuracy	<40%	knowledge-task median	≈ -35 percentage points	MedCalc, EHRSQL (operational tasks)	Operational tasks show median <40% while knowledge tasks are >75%	Section 3.3, Figure 4a

What To Try In 7 Days

Run MEDIC's MedCalc and EHRSQL tasks on candidate models to catch arithmetic and SQL failures.

Apply CEF to your note-generation outputs to measure Coverage, Consistency, and Conformity without references.

Add an active-safety test (MEDEC or similar) to measure error-detection, not just refusal behavior.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/m42-health/cross-examination-framework https://huggingface.co/spaces/m42-health/MEDIC-Benchmark

Data URLs

References to public datasets used (MedCalc, EHRSQL, ACI-Bench, DischargeMe, MEDEC repositories as cited)Dataset links cited in Appendix A.3 and references

Risks & Boundaries

Limitations

LLM-as-a-judge methods can inherit judge biases (length, self-preference) despite high inter-judge agreement.

Many safety datasets are physician-centric and may not cover patient or nursing concerns.

When Not To Use

Do not treat MEDIC scores as sufficient clinical validation for deployment without human trials.

Do not rely solely on CEF or LLM-judges to certify safety in high-risk clinical decisions.

Failure Modes

Automation bias: over-trusting leading indicators and skipping human validation.

Goodhart's Law: models optimized to game MEDIC tasks may hide unmeasured failure modes.

Core Entities

Models

GPT-OSS-120BGPT-OSS-20BLlama-4-MaverickDeepSeek-V3.1Med42-v2-8BMistral-Large-3-675BQwen2.5-72BKimi-K2-ThinkingPhi-4

Metrics

AccuracyExecution success / exact matchRS(0) (Reliability Score)Coverage / Conformity / Consistency / Conciseness (CEF)Elo (pairwise)Refusal / Harmfulness scoreF1 (error detection)ROUGE / BLEU / BERTScore

Datasets

MedQAMedMCQAMMLU-ProMedCalcEHRSQLDischargeMeACI-BenchMEDECMedicationQAHealthSearchQAExpertQAGSM8KAIMEIFEvalMed-SafetyToxiGenPubMedQAHealthBench

Benchmarks

MEDIC frameworkCross-Examination Framework (CEF)EHRSQL Reliability Score (RS(0))MEDEC stages

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Static knowledge benchmarks are saturated, but operational tasks lag far behind.

Passive safety (refusal) is near-perfect while active error detection is poor.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding