Overview
The benchmark is practically useful and reproducible (code and data released). It is novel for pancreatic oncology, but limited by sample size and single-disease scope. Evidence for core claims is strong.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
If you plan to deploy LLMs for patient-facing oncology guidance, you must test both completeness and factual accuracy on real patient questions; rubric scores alone can be misleading and AI-only rubrics overestimate quality.
Who Should Care
Summary TLDR
PanCanBench is a disease-specific benchmark built from 282 de-identified real pancreatic cancer patient questions and 3,130 question-specific rubric items crafted with oncology fellows. The authors evaluate 22 LLMs for clinical completeness, factual errors, and web-search integration. Scores vary widely (46.5%–82.3%); hallucination rates range from ~6% to ~54%. Web search did not reliably improve overall performance. AI-generated rubrics inflate scores by ~17.9 points versus human rubrics. The paper validates an LLM-as-judge pipeline (Cohen's κ≈0.53, F1=0.838) and releases code and rubrics for reuse.
Problem Statement
Standard LLM tests (multiple-choice, synthetic prompts) miss real clinical complexity and factual risk. We need a disease-focused, expert-grounded benchmark of real patient questions plus a factuality check to evaluate both completeness and hallucinations for safe patient-facing use.
Main Contribution
Created PanCanBench: 282 real patient questions with 3,130 question-specific rubric criteria.
Built a human-in-the-loop rubric pipeline combining oncology fellows and AI to produce validated rubrics.
Key Findings
Models' rubric-based completeness scores vary widely.
Factual errors (hallucinations) are common and differ by model.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Top rubric-based average score (model) | 82.3% (o3) | — | — | Full PanCanBench | Section 3.3.1, Figure 4a | Section 3.3.1 |
| Example model scores | Grok-4 80.4%; GPT-5 78.4%; Olmo3-32B-think 66.3% | — | — | Full PanCanBench | Section 3.3.1, Figure 4a | Section 3.3.1 |
What To Try In 7 Days
Run PanCanBench (public repo and dataset) on your model to get domain-specific completeness and hallucination metrics.
Validate any LLM-as-a-judge you use: compute Cohen's κ and F1 against clinician graders on a held-out subset.
Test your web-search integration: compare with search off to check for omission or crowding out of internal knowledge.
Reproducibility
Risks & Boundaries
Limitations
Single disease focus (pancreatic cancer) limits generalization to other specialties
Moderate sample size (282 questions) may miss rarer clinical scenarios
When Not To Use
Do not generalize results to other diseases without re-running domain-specific rubrics
Do not rely solely on AI-generated rubrics for safety-critical decisions
Failure Modes
Hallucinations that assert incorrect staging or treatment facts
Omission of critical details when web search is enabled ('crowding out')

