Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
6
Why It Matters For Business
MEDIC gives practical, faster checks for clinical readiness: it flags operational failures and hallucinations that standard exams miss, reducing deployment risk before costly pilots.
Summary TLDR
MEDIC is a modular evaluation framework for clinical LLMs that stresses operational tasks, safety auditing, and reference-free factuality checks. It adds a Cross-Examination Framework (CEF) to verify summaries without human references, and runs deterministic execution tests (e.g., SQL, clinical calculations). Key findings: (1) static medical knowledge (USMLE-style) is saturated and does not predict success on precise operational tasks (MedCalc, EHRSQL); (2) 'passive' safety (refusal) is not the same as 'active' safety (error detection); (3) no single model dominates across tasks, and scale does not guarantee factual conformity. Public leaderboard and code for CEF are provided.
Problem Statement
Standard medical benchmarks (MCQs, USMLE-style) are saturated and fail to predict whether an LLM can perform precise clinical operations (calculations, SQL, error auditing). Teams need fast, offline leading indicators that catch operational failures and hallucinations before costly, risky pilots.
Main Contribution
Define MEDIC — a five-dimension evaluation framework (Medical reasoning, Ethics & bias, Data & language, In-context learning, Clinical safety) focused on functional clinical utility.
Introduce the Cross-Examination Framework (CEF) — a reference-free, question-based verifier that measures Coverage, Conformity, Consistency, and Conciseness.
Assemble a heterogeneous task suite (MedCalc, EHRSQL, MEDEC, DischargeMe, ACI-Bench, MedQA, MedMCQA, etc.) and evaluate many open models under uniform harness conditions.
Reveal three practical gaps: knowledge vs execution, passive vs active safety, and task-dependent model heterogeneity.
Provide a public leaderboard and open CEF code to reproduce and extend evaluations.
Key Findings
Static knowledge benchmarks are saturated, but operational tasks lag far behind.
Passive safety (refusal) is near-perfect while active error detection is poor.
Reference-free factuality checks (CEF) reveal hallucinations not captured by lexical metrics.
No single architecture dominates; larger models can produce more contradictions.
Pairwise LLM-judge rankings for open-ended clinical QA are highly robust across judges.
Results
Accuracy
Accuracy
Accuracy
CEF vs lexical metrics correlation
Open-ended QA inter-judge rank correlation
Who Should Care
What To Try In 7 Days
Run MEDIC's MedCalc and EHRSQL tasks on candidate models to catch arithmetic and SQL failures.
Apply CEF to your note-generation outputs to measure Coverage, Consistency, and Conformity without references.
Add an active-safety test (MEDEC or similar) to measure error-detection, not just refusal behavior.
Reproducibility
Code Urls
Data Urls
- References to public datasets used (MedCalc, EHRSQL, ACI-Bench, DischargeMe, MEDEC repositories as cited)
- Dataset links cited in Appendix A.3 and references
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- LLM-as-a-judge methods can inherit judge biases (length, self-preference) despite high inter-judge agreement.
- Many safety datasets are physician-centric and may not cover patient or nursing concerns.
- Automated metrics are leading indicators only and cannot substitute in-situ clinical validation.
- Some models could not be evaluated on long-context tasks due to context-window or memory limits.
When Not To Use
- Do not treat MEDIC scores as sufficient clinical validation for deployment without human trials.
- Do not rely solely on CEF or LLM-judges to certify safety in high-risk clinical decisions.
Failure Modes
- Automation bias: over-trusting leading indicators and skipping human validation.
- Goodhart's Law: models optimized to game MEDIC tasks may hide unmeasured failure modes.
- Judge bias: LLM evaluators can prefer certain styles or lengths.
- Context limit failures: architectures with short context windows fail long-note tasks.
Core Entities
Models
- GPT-OSS-120B
- GPT-OSS-20B
- Llama-4-Maverick
- DeepSeek-V3.1
- Med42-v2-8B
- Mistral-Large-3-675B
- Qwen2.5-72B
- Kimi-K2-Thinking
- Phi-4
Metrics
- Accuracy
- Execution success / exact match
- RS(0) (Reliability Score)
- Coverage / Conformity / Consistency / Conciseness (CEF)
- Elo (pairwise)
- Refusal / Harmfulness score
- F1 (error detection)
- ROUGE / BLEU / BERTScore
Datasets
- MedQA
- MedMCQA
- MMLU-Pro
- MedCalc
- EHRSQL
- DischargeMe
- ACI-Bench
- MEDEC
- MedicationQA
- HealthSearchQA
- ExpertQA
- GSM8K
- AIME
- IFEval
- Med-Safety
- ToxiGen
- PubMedQA
- HealthBench
Benchmarks
- MEDIC framework
- Cross-Examination Framework (CEF)
- EHRSQL Reliability Score (RS(0))
- MEDEC stages

