PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

March 2, 20268 min

Overview

Decision SnapshotNeeds Validation

The benchmark is practically useful and reproducible (code and data released). It is novel for pancreatic oncology, but limited by sample size and single-disease scope. Evidence for core claims is strong.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard, Manuel F Fernandez, Fatima Zelada-Arenas, Alejandra Alvarez, Brianne Flores, Alexis Rodriguez, Stephen Salerno, Carrie Wright, Zihao Wang, Pang Wei Koh, Jeffrey T. Leek

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you plan to deploy LLMs for patient-facing oncology guidance, you must test both completeness and factual accuracy on real patient questions; rubric scores alone can be misleading and AI-only rubrics overestimate quality.

Who Should Care

Summary TLDR

PanCanBench is a disease-specific benchmark built from 282 de-identified real pancreatic cancer patient questions and 3,130 question-specific rubric items crafted with oncology fellows. The authors evaluate 22 LLMs for clinical completeness, factual errors, and web-search integration. Scores vary widely (46.5%–82.3%); hallucination rates range from ~6% to ~54%. Web search did not reliably improve overall performance. AI-generated rubrics inflate scores by ~17.9 points versus human rubrics. The paper validates an LLM-as-judge pipeline (Cohen's κ≈0.53, F1=0.838) and releases code and rubrics for reuse.

Problem Statement

Standard LLM tests (multiple-choice, synthetic prompts) miss real clinical complexity and factual risk. We need a disease-focused, expert-grounded benchmark of real patient questions plus a factuality check to evaluate both completeness and hallucinations for safe patient-facing use.

Main Contribution

Created PanCanBench: 282 real patient questions with 3,130 question-specific rubric criteria.

Built a human-in-the-loop rubric pipeline combining oncology fellows and AI to produce validated rubrics.

Key Findings

Models' rubric-based completeness scores vary widely.

NumbersTop score 82.3% (o3); range 46.5%–82.3%

Practical UseDo not assume uniform quality—test each model on domain questions before deployment.

Evidence RefSection 3.3.1, Figure 4a

Factual errors (hallucinations) are common and differ by model.

NumbersError rates 6.0% (GPT-4o, Gemini-2.5 Pro) to 53.8% (Llama-3.1-8B)

Practical UseSet explicit hallucination thresholds and filter or human-review outputs for clinical use.

Evidence RefSection 3.3.2, Figure 4b

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Top rubric-based average score (model)82.3% (o3)Full PanCanBenchSection 3.3.1, Figure 4aSection 3.3.1
Example model scoresGrok-4 80.4%; GPT-5 78.4%; Olmo3-32B-think 66.3%Full PanCanBenchSection 3.3.1, Figure 4aSection 3.3.1

What To Try In 7 Days

Run PanCanBench (public repo and dataset) on your model to get domain-specific completeness and hallucination metrics.

Validate any LLM-as-a-judge you use: compute Cohen's κ and F1 against clinician graders on a held-out subset.

Test your web-search integration: compare with search off to check for omission or crowding out of internal knowledge.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Single disease focus (pancreatic cancer) limits generalization to other specialties

Moderate sample size (282 questions) may miss rarer clinical scenarios

When Not To Use

Do not generalize results to other diseases without re-running domain-specific rubrics

Do not rely solely on AI-generated rubrics for safety-critical decisions

Failure Modes

Hallucinations that assert incorrect staging or treatment facts

Omission of critical details when web search is enabled ('crowding out')

Core Entities

Models

GPT-5o3Grok-4Gemini-2.5 ProGemini-2.5 FlashGPT-4oGPT-4.1Llama-3.1-70BLlama-3.1-8BOlmo-3.1-32B-InstructOlmo-3-32b-thinkQwen-3-32BQwen-3-14BQwen-3-8BGemma-3-27BGemma-3-12B

Metrics

Rubric-based completeness (%)Hallucination rate (% responses with ≥1 factual error)Percentage of supportive linksWeb search triggering rate (%)Web search resource appropriateness rate (%)Cohen's κF1 score (LLM-as-judge)

Datasets

PanCanBench (282 questions, 3,130 rubric items)

Benchmarks

PanCanBenchHealthBench (comparison)