Overview
MEDIC is ready for research and engineering use as a leading-indicator suite; it helps prioritize models but cannot replace real-world pilots or clinician sign-off.
Citations6
Evidence Strength0.85
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
MEDIC gives practical, faster checks for clinical readiness: it flags operational failures and hallucinations that standard exams miss, reducing deployment risk before costly pilots.
Who Should Care
Summary TLDR
MEDIC is a modular evaluation framework for clinical LLMs that stresses operational tasks, safety auditing, and reference-free factuality checks. It adds a Cross-Examination Framework (CEF) to verify summaries without human references, and runs deterministic execution tests (e.g., SQL, clinical calculations). Key findings: (1) static medical knowledge (USMLE-style) is saturated and does not predict success on precise operational tasks (MedCalc, EHRSQL); (2) 'passive' safety (refusal) is not the same as 'active' safety (error detection); (3) no single model dominates across tasks, and scale does not guarantee factual conformity. Public leaderboard and code for CEF are provided.
Problem Statement
Standard medical benchmarks (MCQs, USMLE-style) are saturated and fail to predict whether an LLM can perform precise clinical operations (calculations, SQL, error auditing). Teams need fast, offline leading indicators that catch operational failures and hallucinations before costly, risky pilots.
Main Contribution
Define MEDIC — a five-dimension evaluation framework (Medical reasoning, Ethics & bias, Data & language, In-context learning, Clinical safety) focused on functional clinical utility.
Introduce the Cross-Examination Framework (CEF) — a reference-free, question-based verifier that measures Coverage, Conformity, Consistency, and Conciseness.
Key Findings
Static knowledge benchmarks are saturated, but operational tasks lag far behind.
Passive safety (refusal) is near-perfect while active error detection is poor.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | >75% | state-of-the-art saturating MCQs | — | MedQA, MedMCQA (knowledge tasks) | Top-tier models show median >75% on knowledge tasks | Section 3.3, Figure 4a |
| Accuracy | <40% | knowledge-task median | ≈ -35 percentage points | MedCalc, EHRSQL (operational tasks) | Operational tasks show median <40% while knowledge tasks are >75% | Section 3.3, Figure 4a |
What To Try In 7 Days
Run MEDIC's MedCalc and EHRSQL tasks on candidate models to catch arithmetic and SQL failures.
Apply CEF to your note-generation outputs to measure Coverage, Consistency, and Conformity without references.
Add an active-safety test (MEDEC or similar) to measure error-detection, not just refusal behavior.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
LLM-as-a-judge methods can inherit judge biases (length, self-preference) despite high inter-judge agreement.
Many safety datasets are physician-centric and may not cover patient or nursing concerns.
When Not To Use
Do not treat MEDIC scores as sufficient clinical validation for deployment without human trials.
Do not rely solely on CEF or LLM-judges to certify safety in high-risk clinical decisions.
Failure Modes
Automation bias: over-trusting leading indicators and skipping human validation.
Goodhart's Law: models optimized to game MEDIC tasks may hide unmeasured failure modes.

