Overview
Production Readiness
0.35
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Static medical benchmarks can give falsely high performance because models may have seen test cases during training. LiveMedBench detects contamination and tests real-time generalization. If you deploy LLMs for clinical tasks, evaluate on time-separated, evidence-checked benchmarks and prefer retrieval-enabled systems.
Summary TLDR
LiveMedBench is a continually refreshed medical benchmark that harvests real clinical Q&A threads, filters and verifies them with a multi-agent pipeline, and generates case-specific binary rubrics to score open-ended LLM answers. On 2,756 bilingual cases (16,702 criteria) the best model scores 39.2%. The automated rubric grader correlates with physicians much better than an LLM-as-a-judge and reveals that most failures are about applying facts to patient context, not raw knowledge gaps.
Problem Statement
Static medical benchmarks leak into model training and age quickly. That inflates scores and misses real-world change. Also, common automatic metrics (lexical overlap) and LLM-as-a-judge approaches poorly verify clinical correctness for open-ended medical advice.
Main Contribution
LiveMedBench: a weekly-updated, bilingual (English/Chinese) medical benchmark built from verified online clinical threads; current snapshot: 2,756 cases and 16,702 binary rubric criteria.
Multi-Agent Clinical Curation Framework: three-agent pipeline (Screener, Validator, Controller) that structures threads into narrative/query/advice, validates against medical evidence, and vetoes hallucinated details.
Automated Rubric-based Evaluation Framework: converts physician replies into case-specific, weighted binary criteria and uses an automated grader to score model outputs.
Extensive evaluation of 38 LLMs (proprietary and open-source), plus human validation showing rubric grader better aligns with physician judgments than LLM-as-a-judge.
Key Findings
LiveMedBench snapshot size and scope
Top model performance is low
Widespread performance drop on post-cutoff cases
Automated rubric grader aligns with physicians
Human validation shows high dataset quality
Dominant failure mode is contextual application
Retrieval (open-book) partially recovers performance
Results
Dataset size
Total rubric criteria
Languages
Average criteria per case
Best model score
Models degrading on post-cutoff cases
Rubric-based Grader alignment (criterion-level)
Rubric-based Grader case-level correlation
Human agreement on narrative & advice
Who Should Care
What To Try In 7 Days
Run your model on a small LiveMedBench snapshot or a time-split holdout to check for contamination and real-world drift.
Add a retrieval step (closed-book → open-book) and measure any immediate score change to estimate knowledge-obsolescence risk.
Replace any holistic auto-judging step with simple case-level rubrics for critical checks (safety, contraindications) and compare outcomes.
Agent Features
Planning
- evidence-guided validation and veto rules
- theme-guided rubric generation
Tool Use
- Retrieval-Augmented Generation (RAG)
- MedCPT retriever
- gpt-4.1 as automated grader
- Qwen/Qwen3-4B-Instruct for pipeline agents
Frameworks
- Multi-Agent Clinical Curation Framework
- Automated Rubric-based Evaluation Framework
Is Agentic
true
Architectures
- hierarchical multi-agent pipeline (Screener/Validator/Controller)
- automated Rubric Generator + Rubric-based Grader
Collaboration
- multi-agent coordination with veto and audit steps
- human-in-the-loop final quality assurance
Reproducibility
Code Urls
- LiveMedBench project page (paper states code and data available)
Data Urls
- LiveMedBench project page (paper states code and data available)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Sources are limited to four public English/Chinese communities and may carry demographic and practice-pattern biases.
- Only text-only cases are included; multimodal cases (images, labs as images) are excluded.
- Weekly live updates require careful snapshotting for reproducible comparisons.
- Automated components depend on specific large-model APIs (grader, retriever) which may change over time.
When Not To Use
- Do not use LiveMedBench as a clinical decision tool or for patient care.
- Avoid using it when you need multimodal (image/video) clinical evaluation.
- Not suitable if your deployment language is neither English nor Chinese without adaptation.
Failure Modes
- Contextual Neglect & Integration Failure (models fail to tailor facts to patient specifics)
- Guideline Overgeneralization (apply rules too rigidly to unique cases)
- Knowledge obsolescence / data contamination (models memorized older cases)
- Grader mismatch if pipeline models/APIs drift (evaluation instability)
Core Entities
Models
- GPT-5.2
- GPT-5.1
- GPT-4.1
- Grok-4.1
- Baichuan-M3
- GPT-OSS-120B
- GLM-4.5
- Qwen3-14B
- Gemini-2.5-Pro
- Med-Gemma-27B
Metrics
- LiveMedBench score (normalized rubric sum)
- Macro F1 (criterion-level)
- Pearson correlation (case-level)
- Gwet's AC1 (human agreement)
Datasets
- LiveMedBench
Benchmarks
- HealthBench
- MedQA
- MultiMedQA
- DyReMe
- MedPerturb
- LiveBench
Context Entities
Models
- Gemini-3-Pro
- Claude-3.7-Sonnet
- GLM-4
- Qwen2.5-32B
- DeepSeek-V3.2
Metrics
- Accuracy
Datasets
- HealthBench (comparison)
Benchmarks
- MedArena
- MedJourney

