A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

February 10, 20269 min

Overview

Decision SnapshotNeeds Validation

Benchmark and automated rubrics are well validated with physician agreement and statistical correlation; however, live updating and grader dependencies on specific API versions reduce immediate clinical production readiness.

Citations0

Evidence Strength0.82

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 2/9

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 35%

Novelty: 70%

Authors

Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Static medical benchmarks can give falsely high performance because models may have seen test cases during training. LiveMedBench detects contamination and tests real-time generalization. If you deploy LLMs for clinical tasks, evaluate on time-separated, evidence-checked benchmarks and prefer retrieval-enabled systems.

Who Should Care

Summary TLDR

LiveMedBench is a continually refreshed medical benchmark that harvests real clinical Q&A threads, filters and verifies them with a multi-agent pipeline, and generates case-specific binary rubrics to score open-ended LLM answers. On 2,756 bilingual cases (16,702 criteria) the best model scores 39.2%. The automated rubric grader correlates with physicians much better than an LLM-as-a-judge and reveals that most failures are about applying facts to patient context, not raw knowledge gaps.

Problem Statement

Static medical benchmarks leak into model training and age quickly. That inflates scores and misses real-world change. Also, common automatic metrics (lexical overlap) and LLM-as-a-judge approaches poorly verify clinical correctness for open-ended medical advice.

Main Contribution

LiveMedBench: a weekly-updated, bilingual (English/Chinese) medical benchmark built from verified online clinical threads; current snapshot: 2,756 cases and 16,702 binary rubric criteria.

Multi-Agent Clinical Curation Framework: three-agent pipeline (Screener, Validator, Controller) that structures threads into narrative/query/advice, validates against medical evidence, and vetoes hallucinated details.

Key Findings

LiveMedBench snapshot size and scope

Numbers2,756 cases; 16,702 rubric criteria; 38 specialties

Practical UseUse this dataset when you need many real-world, case-level tests across specialties and languages rather than small static exam-style sets.

Evidence Ref§3.4, Fig.3

Top model performance is low

NumbersGPT-5.2 = 39.2% mean score on LiveMedBench

Practical UseExpect current state-of-the-art LLMs to get well under half of rubric checks on real clinical cases; don't trust high scores from older static benchmarks.

Evidence Ref§4.2, Fig.4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size2,756 clinical casesLiveMedBench§3.4, Abstract
Total rubric criteria16,702 binary criteriaLiveMedBench§3.4, Abstract

What To Try In 7 Days

Run your model on a small LiveMedBench snapshot or a time-split holdout to check for contamination and real-world drift.

Add a retrieval step (closed-book → open-book) and measure any immediate score change to estimate knowledge-obsolescence risk.

Replace any holistic auto-judging step with simple case-level rubrics for critical checks (safety, contraindications) and compare outcomes.

Agent Features

Planning
evidence-guided validation and veto rulestheme-guided rubric generation
Tool Use
Retrieval-Augmented Generation (RAG)MedCPT retrievergpt-4.1 as automated graderQwen/Qwen3-4B-Instruct for pipeline agents
Frameworks
Multi-Agent Clinical Curation FrameworkAutomated Rubric-based Evaluation Framework
Is Agentic

Yes

Architectures
hierarchical multi-agent pipeline (Screener/Validator/Controller)automated Rubric Generator + Rubric-based Grader
Collaboration
multi-agent coordination with veto and audit stepshuman-in-the-loop final quality assurance

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Code URLs

LiveMedBench project page (paper states code and data available)

Data URLs

LiveMedBench project page (paper states code and data available)

Risks & Boundaries

Limitations

Sources are limited to four public English/Chinese communities and may carry demographic and practice-pattern biases.

Only text-only cases are included; multimodal cases (images, labs as images) are excluded.

When Not To Use

Do not use LiveMedBench as a clinical decision tool or for patient care.

Avoid using it when you need multimodal (image/video) clinical evaluation.

Failure Modes

Contextual Neglect & Integration Failure (models fail to tailor facts to patient specifics)

Guideline Overgeneralization (apply rules too rigidly to unique cases)

Core Entities

Models

GPT-5.2GPT-5.1GPT-4.1Grok-4.1Baichuan-M3GPT-OSS-120BGLM-4.5Qwen3-14BGemini-2.5-ProMed-Gemma-27B

Metrics

LiveMedBench score (normalized rubric sum)Macro F1 (criterion-level)Pearson correlation (case-level)Gwet's AC1 (human agreement)

Datasets

LiveMedBench

Benchmarks

HealthBenchMedQAMultiMedQADyReMeMedPerturbLiveBench

Context Entities

Models

Gemini-3-ProClaude-3.7-SonnetGLM-4Qwen2.5-32BDeepSeek-V3.2

Metrics

Accuracy

Datasets

HealthBench (comparison)

Benchmarks

MedArenaMedJourney