A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Overview

Decision SnapshotNeeds Validation

Benchmark and automated rubrics are well validated with physician agreement and statistical correlation; however, live updating and grader dependencies on specific API versions reduce immediate clinical production readiness.

Citations0

Evidence Strength0.82

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 2/9

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 35%

Novelty: 70%

Authors

Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Static medical benchmarks can give falsely high performance because models may have seen test cases during training. LiveMedBench detects contamination and tests real-time generalization. If you deploy LLMs for clinical tasks, evaluate on time-separated, evidence-checked benchmarks and prefer retrieval-enabled systems.

Who Should Care

CTO ML Engineer Data Scientist Product Manager Engineering Lead

Summary TLDR

LiveMedBench is a continually refreshed medical benchmark that harvests real clinical Q&A threads, filters and verifies them with a multi-agent pipeline, and generates case-specific binary rubrics to score open-ended LLM answers. On 2,756 bilingual cases (16,702 criteria) the best model scores 39.2%. The automated rubric grader correlates with physicians much better than an LLM-as-a-judge and reveals that most failures are about applying facts to patient context, not raw knowledge gaps.

Problem Statement

Static medical benchmarks leak into model training and age quickly. That inflates scores and misses real-world change. Also, common automatic metrics (lexical overlap) and LLM-as-a-judge approaches poorly verify clinical correctness for open-ended medical advice.

Main Contribution

LiveMedBench: a weekly-updated, bilingual (English/Chinese) medical benchmark built from verified online clinical threads; current snapshot: 2,756 cases and 16,702 binary rubric criteria.

Multi-Agent Clinical Curation Framework: three-agent pipeline (Screener, Validator, Controller) that structures threads into narrative/query/advice, validates against medical evidence, and vetoes hallucinated details.

Key Findings

LiveMedBench snapshot size and scope

Numbers2,756 cases; 16,702 rubric criteria; 38 specialties

Practical UseUse this dataset when you need many real-world, case-level tests across specialties and languages rather than small static exam-style sets.

Evidence Ref§3.4, Fig.3

Top model performance is low

NumbersGPT-5.2 = 39.2% mean score on LiveMedBench

Practical UseExpect current state-of-the-art LLMs to get well under half of rubric checks on real clinical cases; don't trust high scores from older static benchmarks.

Evidence Ref§4.2, Fig.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	2,756 clinical cases	—	—	LiveMedBench	§3.4, Abstract	—
Total rubric criteria	16,702 binary criteria	—	—	LiveMedBench	§3.4, Abstract	—

What To Try In 7 Days

Run your model on a small LiveMedBench snapshot or a time-split holdout to check for contamination and real-world drift.

Add a retrieval step (closed-book → open-book) and measure any immediate score change to estimate knowledge-obsolescence risk.

Replace any holistic auto-judging step with simple case-level rubrics for critical checks (safety, contraindications) and compare outcomes.

Agent Features

Planning

evidence-guided validation and veto rulestheme-guided rubric generation

Tool Use

Retrieval-Augmented Generation (RAG)MedCPT retrievergpt-4.1 as automated graderQwen/Qwen3-4B-Instruct for pipeline agents

Frameworks

Multi-Agent Clinical Curation FrameworkAutomated Rubric-based Evaluation Framework

Is Agentic

Yes

Architectures

hierarchical multi-agent pipeline (Screener/Validator/Controller)automated Rubric Generator + Rubric-based Grader

Collaboration

multi-agent coordination with veto and audit stepshuman-in-the-loop final quality assurance

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

LiveMedBench project page (paper states code and data available)

Data URLs

LiveMedBench project page (paper states code and data available)

Risks & Boundaries

Limitations

Sources are limited to four public English/Chinese communities and may carry demographic and practice-pattern biases.

Only text-only cases are included; multimodal cases (images, labs as images) are excluded.

When Not To Use

Do not use LiveMedBench as a clinical decision tool or for patient care.

Avoid using it when you need multimodal (image/video) clinical evaluation.

Failure Modes

Contextual Neglect & Integration Failure (models fail to tailor facts to patient specifics)

Guideline Overgeneralization (apply rules too rigidly to unique cases)

Core Entities

Models

GPT-5.2GPT-5.1GPT-4.1Grok-4.1Baichuan-M3GPT-OSS-120BGLM-4.5Qwen3-14BGemini-2.5-ProMed-Gemma-27B

Metrics

LiveMedBench score (normalized rubric sum)Macro F1 (criterion-level)Pearson correlation (case-level)Gwet's AC1 (human agreement)

Datasets

LiveMedBench

Benchmarks

HealthBenchMedQAMultiMedQADyReMeMedPerturbLiveBench

Context Entities

Models

Gemini-3-ProClaude-3.7-SonnetGLM-4Qwen2.5-32BDeepSeek-V3.2

Metrics

Accuracy

Datasets

HealthBench (comparison)

Benchmarks

MedArenaMedJourney

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LiveMedBench snapshot size and scope

Top model performance is low

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Key finding

Small prompt or format changes can reorder LLM leaderboards by many ranks

Key finding