A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

February 10, 20269 min

Overview

Production Readiness

0.35

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

0

Authors

Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun

Links

Abstract / PDF

Why It Matters For Business

Static medical benchmarks can give falsely high performance because models may have seen test cases during training. LiveMedBench detects contamination and tests real-time generalization. If you deploy LLMs for clinical tasks, evaluate on time-separated, evidence-checked benchmarks and prefer retrieval-enabled systems.

Summary TLDR

LiveMedBench is a continually refreshed medical benchmark that harvests real clinical Q&A threads, filters and verifies them with a multi-agent pipeline, and generates case-specific binary rubrics to score open-ended LLM answers. On 2,756 bilingual cases (16,702 criteria) the best model scores 39.2%. The automated rubric grader correlates with physicians much better than an LLM-as-a-judge and reveals that most failures are about applying facts to patient context, not raw knowledge gaps.

Problem Statement

Static medical benchmarks leak into model training and age quickly. That inflates scores and misses real-world change. Also, common automatic metrics (lexical overlap) and LLM-as-a-judge approaches poorly verify clinical correctness for open-ended medical advice.

Main Contribution

LiveMedBench: a weekly-updated, bilingual (English/Chinese) medical benchmark built from verified online clinical threads; current snapshot: 2,756 cases and 16,702 binary rubric criteria.

Multi-Agent Clinical Curation Framework: three-agent pipeline (Screener, Validator, Controller) that structures threads into narrative/query/advice, validates against medical evidence, and vetoes hallucinated details.

Automated Rubric-based Evaluation Framework: converts physician replies into case-specific, weighted binary criteria and uses an automated grader to score model outputs.

Extensive evaluation of 38 LLMs (proprietary and open-source), plus human validation showing rubric grader better aligns with physician judgments than LLM-as-a-judge.

Key Findings

LiveMedBench snapshot size and scope

Numbers2,756 cases; 16,702 rubric criteria; 38 specialties

Top model performance is low

NumbersGPT-5.2 = 39.2% mean score on LiveMedBench

Widespread performance drop on post-cutoff cases

Numbers84% of models (32/38) degrade on post-cutoff cases

Automated rubric grader aligns with physicians

NumbersMacro F1 = 0.76; Pearson ρ = 0.54 vs human consensus

Human validation shows high dataset quality

NumbersGwet's AC1 ≥ 0.8914 for generated components and criteria

Dominant failure mode is contextual application

NumbersContextual Neglect/Integration Failure = 35–48% of errors

Retrieval (open-book) partially recovers performance

NumbersOpen-book scores > Closed-book across sampled models (e.g., Baichuan M3 +0.67 pts)

Results

Dataset size

Value2,756 clinical cases

Total rubric criteria

Value16,702 binary criteria

Languages

Value1,568 English; 1,188 Chinese

Average criteria per case

Value6.06 (range 2–19)

Best model score

Value39.2%

Models degrading on post-cutoff cases

Value84% (32/38 models)

Rubric-based Grader alignment (criterion-level)

ValueMacro F1 = 0.76

BaselineHuman inter-rater Macro F1 = 0.89

Rubric-based Grader case-level correlation

ValuePearson ρ = 0.54 (p < 1e-4)

BaselineLLM-as-a-Judge ρ = 0.26 (p = 0.07)

Human agreement on narrative & advice

ValueGwet's AC1 ≥ 0.9566

Who Should Care

What To Try In 7 Days

Run your model on a small LiveMedBench snapshot or a time-split holdout to check for contamination and real-world drift.

Add a retrieval step (closed-book → open-book) and measure any immediate score change to estimate knowledge-obsolescence risk.

Replace any holistic auto-judging step with simple case-level rubrics for critical checks (safety, contraindications) and compare outcomes.

Agent Features

Planning

  • evidence-guided validation and veto rules
  • theme-guided rubric generation

Tool Use

  • Retrieval-Augmented Generation (RAG)
  • MedCPT retriever
  • gpt-4.1 as automated grader
  • Qwen/Qwen3-4B-Instruct for pipeline agents

Frameworks

  • Multi-Agent Clinical Curation Framework
  • Automated Rubric-based Evaluation Framework

Is Agentic

true

Architectures

  • hierarchical multi-agent pipeline (Screener/Validator/Controller)
  • automated Rubric Generator + Rubric-based Grader

Collaboration

  • multi-agent coordination with veto and audit steps
  • human-in-the-loop final quality assurance

Reproducibility

Code Urls

  • LiveMedBench project page (paper states code and data available)

Data Urls

  • LiveMedBench project page (paper states code and data available)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Sources are limited to four public English/Chinese communities and may carry demographic and practice-pattern biases.
  • Only text-only cases are included; multimodal cases (images, labs as images) are excluded.
  • Weekly live updates require careful snapshotting for reproducible comparisons.
  • Automated components depend on specific large-model APIs (grader, retriever) which may change over time.

When Not To Use

  • Do not use LiveMedBench as a clinical decision tool or for patient care.
  • Avoid using it when you need multimodal (image/video) clinical evaluation.
  • Not suitable if your deployment language is neither English nor Chinese without adaptation.

Failure Modes

  • Contextual Neglect & Integration Failure (models fail to tailor facts to patient specifics)
  • Guideline Overgeneralization (apply rules too rigidly to unique cases)
  • Knowledge obsolescence / data contamination (models memorized older cases)
  • Grader mismatch if pipeline models/APIs drift (evaluation instability)

Core Entities

Models

  • GPT-5.2
  • GPT-5.1
  • GPT-4.1
  • Grok-4.1
  • Baichuan-M3
  • GPT-OSS-120B
  • GLM-4.5
  • Qwen3-14B
  • Gemini-2.5-Pro
  • Med-Gemma-27B

Metrics

  • LiveMedBench score (normalized rubric sum)
  • Macro F1 (criterion-level)
  • Pearson correlation (case-level)
  • Gwet's AC1 (human agreement)

Datasets

  • LiveMedBench

Benchmarks

  • HealthBench
  • MedQA
  • MultiMedQA
  • DyReMe
  • MedPerturb
  • LiveBench

Context Entities

Models

  • Gemini-3-Pro
  • Claude-3.7-Sonnet
  • GLM-4
  • Qwen2.5-32B
  • DeepSeek-V3.2

Metrics

  • Accuracy

Datasets

  • HealthBench (comparison)

Benchmarks

  • MedArena
  • MedJourney