PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Overview

Decision SnapshotNeeds Validation

The benchmark is practically useful and reproducible (code and data released). It is novel for pancreatic oncology, but limited by sample size and single-disease scope. Evidence for core claims is strong.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard, Manuel F Fernandez, Fatima Zelada-Arenas, Alejandra Alvarez, Brianne Flores, Alexis Rodriguez, Stephen Salerno, Carrie Wright, Zihao Wang, Pang Wei Koh, Jeffrey T. Leek

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you plan to deploy LLMs for patient-facing oncology guidance, you must test both completeness and factual accuracy on real patient questions; rubric scores alone can be misleading and AI-only rubrics overestimate quality.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

PanCanBench is a disease-specific benchmark built from 282 de-identified real pancreatic cancer patient questions and 3,130 question-specific rubric items crafted with oncology fellows. The authors evaluate 22 LLMs for clinical completeness, factual errors, and web-search integration. Scores vary widely (46.5%–82.3%); hallucination rates range from ~6% to ~54%. Web search did not reliably improve overall performance. AI-generated rubrics inflate scores by ~17.9 points versus human rubrics. The paper validates an LLM-as-judge pipeline (Cohen's κ≈0.53, F1=0.838) and releases code and rubrics for reuse.

Problem Statement

Standard LLM tests (multiple-choice, synthetic prompts) miss real clinical complexity and factual risk. We need a disease-focused, expert-grounded benchmark of real patient questions plus a factuality check to evaluate both completeness and hallucinations for safe patient-facing use.

Main Contribution

Created PanCanBench: 282 real patient questions with 3,130 question-specific rubric criteria.

Built a human-in-the-loop rubric pipeline combining oncology fellows and AI to produce validated rubrics.

Key Findings

Models' rubric-based completeness scores vary widely.

NumbersTop score 82.3% (o3); range 46.5%–82.3%

Practical UseDo not assume uniform quality—test each model on domain questions before deployment.

Evidence RefSection 3.3.1, Figure 4a

Factual errors (hallucinations) are common and differ by model.

NumbersError rates 6.0% (GPT-4o, Gemini-2.5 Pro) to 53.8% (Llama-3.1-8B)

Practical UseSet explicit hallucination thresholds and filter or human-review outputs for clinical use.

Evidence RefSection 3.3.2, Figure 4b

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Top rubric-based average score (model)	82.3% (o3)	—	—	Full PanCanBench	Section 3.3.1, Figure 4a	Section 3.3.1
Example model scores	Grok-4 80.4%; GPT-5 78.4%; Olmo3-32B-think 66.3%	—	—	Full PanCanBench	Section 3.3.1, Figure 4a	Section 3.3.1

What To Try In 7 Days

Run PanCanBench (public repo and dataset) on your model to get domain-specific completeness and hallucination metrics.

Validate any LLM-as-a-judge you use: compute Cohen's κ and F1 against clinician graders on a held-out subset.

Test your web-search integration: compare with search off to check for omission or crowding out of internal knowledge.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/YiminZhao97/PanCanBench

Data URLs

https://huggingface.co/datasets/YiminZ07/PanCanBench

Risks & Boundaries

Limitations

Single disease focus (pancreatic cancer) limits generalization to other specialties

Moderate sample size (282 questions) may miss rarer clinical scenarios

When Not To Use

Do not generalize results to other diseases without re-running domain-specific rubrics

Do not rely solely on AI-generated rubrics for safety-critical decisions

Failure Modes

Hallucinations that assert incorrect staging or treatment facts

Omission of critical details when web search is enabled ('crowding out')

Core Entities

Models

GPT-5o3Grok-4Gemini-2.5 ProGemini-2.5 FlashGPT-4oGPT-4.1Llama-3.1-70BLlama-3.1-8BOlmo-3.1-32B-InstructOlmo-3-32b-thinkQwen-3-32BQwen-3-14BQwen-3-8BGemma-3-27BGemma-3-12B

Metrics

Rubric-based completeness (%)Hallucination rate (% responses with ≥1 factual error)Percentage of supportive linksWeb search triggering rate (%)Web search resource appropriateness rate (%)Cohen's κF1 score (LLM-as-judge)

Datasets

PanCanBench (282 questions, 3,130 rubric items)

Benchmarks

PanCanBenchHealthBench (comparison)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Models' rubric-based completeness scores vary widely.

Factual errors (hallucinations) are common and differ by model.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding