PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

March 2, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard, Manuel F Fernandez, Fatima Zelada-Arenas, Alejandra Alvarez, Brianne Flores, Alexis Rodriguez, Stephen Salerno, Carrie Wright, Zihao Wang, Pang Wei Koh, Jeffrey T. Leek

Links

Abstract / PDF

Why It Matters For Business

If you plan to deploy LLMs for patient-facing oncology guidance, you must test both completeness and factual accuracy on real patient questions; rubric scores alone can be misleading and AI-only rubrics overestimate quality.

Summary TLDR

PanCanBench is a disease-specific benchmark built from 282 de-identified real pancreatic cancer patient questions and 3,130 question-specific rubric items crafted with oncology fellows. The authors evaluate 22 LLMs for clinical completeness, factual errors, and web-search integration. Scores vary widely (46.5%–82.3%); hallucination rates range from ~6% to ~54%. Web search did not reliably improve overall performance. AI-generated rubrics inflate scores by ~17.9 points versus human rubrics. The paper validates an LLM-as-judge pipeline (Cohen's κ≈0.53, F1=0.838) and releases code and rubrics for reuse.

Problem Statement

Standard LLM tests (multiple-choice, synthetic prompts) miss real clinical complexity and factual risk. We need a disease-focused, expert-grounded benchmark of real patient questions plus a factuality check to evaluate both completeness and hallucinations for safe patient-facing use.

Main Contribution

Created PanCanBench: 282 real patient questions with 3,130 question-specific rubric criteria.

Built a human-in-the-loop rubric pipeline combining oncology fellows and AI to produce validated rubrics.

Evaluated 22 LLMs across rubric-based completeness, factual-error detection, and web-search integration.

Validated an LLM-as-a-judge factuality pipeline (dual-model cross-check) and compared AI- vs human-generated rubrics.

Key Findings

Models' rubric-based completeness scores vary widely.

NumbersTop score 82.3% (o3); range 46.5%–82.3%

Factual errors (hallucinations) are common and differ by model.

NumbersError rates 6.0% (GPT-4o, Gemini-2.5 Pro) to 53.8% (Llama-3.1-8B)

Web search did not reliably improve overall rubric scores.

NumbersGemini-2.5 Pro 66.8%→63.9%; GPT-5 73.8%→72.8%

AI-generated rubrics inflate scores versus human rubrics.

NumbersAverage +17.9 points; top model near 95.2% under AI rubrics

LLM-as-a-judge shows human-comparable agreement for grading.

NumbersLLM-human Cohen's κ = 0.528; human-human κ = 0.518; LLM F1 = 0.838

Results

Top rubric-based average score (model)

Value82.3% (o3)

Example model scores

ValueGrok-4 80.4%; GPT-5 78.4%; Olmo3-32B-think 66.3%

Hallucination / factual-error rate by model

ValueRange 6.0% (GPT-4o, Gemini-2.5 Pro) to 53.8% (Llama-3.1-8B)

LLM-as-a-judge reliability

ValueCohen's κ = 0.528; LLM F1 = 0.838

BaselineHuman-human κ = 0.518; human graders F1 = 0.855

Web-search effect (examples)

ValueGemini-2.5 Pro 66.8%→63.9% (-2.9); GPT-5 73.8%→72.8% (-1.0)

BaselineNo-search condition

AI-generated rubric inflation

ValueAverage +17.9 points; Grok-4 reached 95.2% under AI rubrics

BaselineHuman-curated rubric scores

Who Should Care

What To Try In 7 Days

Run PanCanBench (public repo and dataset) on your model to get domain-specific completeness and hallucination metrics.

Validate any LLM-as-a-judge you use: compute Cohen's κ and F1 against clinician graders on a held-out subset.

Test your web-search integration: compare with search off to check for omission or crowding out of internal knowledge.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single disease focus (pancreatic cancer) limits generalization to other specialties
  • Moderate sample size (282 questions) may miss rarer clinical scenarios
  • Potential biases from LLM-based grading despite validation
  • Binary rubric scoring can miss nuanced partial knowledge

When Not To Use

  • Do not generalize results to other diseases without re-running domain-specific rubrics
  • Do not rely solely on AI-generated rubrics for safety-critical decisions
  • Avoid deploying models with high hallucination rates for unsupervised patient-facing use

Failure Modes

  • Hallucinations that assert incorrect staging or treatment facts
  • Omission of critical details when web search is enabled ('crowding out')
  • Citing low-quality or non-peer-reviewed sources for clinical claims
  • Brief responses that fail to meet breadth required by expert rubrics

Core Entities

Models

  • GPT-5
  • o3
  • Grok-4
  • Gemini-2.5 Pro
  • Gemini-2.5 Flash
  • GPT-4o
  • GPT-4.1
  • Llama-3.1-70B
  • Llama-3.1-8B
  • Olmo-3.1-32B-Instruct
  • Olmo-3-32b-think
  • Qwen-3-32B
  • Qwen-3-14B
  • Qwen-3-8B
  • Gemma-3-27B
  • Gemma-3-12B

Metrics

  • Rubric-based completeness (%)
  • Hallucination rate (% responses with ≥1 factual error)
  • Percentage of supportive links
  • Web search triggering rate (%)
  • Web search resource appropriateness rate (%)
  • Cohen's κ
  • F1 score (LLM-as-judge)

Datasets

  • PanCanBench (282 questions, 3,130 rubric items)

Benchmarks

  • PanCanBench
  • HealthBench (comparison)