Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If you plan to deploy LLMs for patient-facing oncology guidance, you must test both completeness and factual accuracy on real patient questions; rubric scores alone can be misleading and AI-only rubrics overestimate quality.
Summary TLDR
PanCanBench is a disease-specific benchmark built from 282 de-identified real pancreatic cancer patient questions and 3,130 question-specific rubric items crafted with oncology fellows. The authors evaluate 22 LLMs for clinical completeness, factual errors, and web-search integration. Scores vary widely (46.5%–82.3%); hallucination rates range from ~6% to ~54%. Web search did not reliably improve overall performance. AI-generated rubrics inflate scores by ~17.9 points versus human rubrics. The paper validates an LLM-as-judge pipeline (Cohen's κ≈0.53, F1=0.838) and releases code and rubrics for reuse.
Problem Statement
Standard LLM tests (multiple-choice, synthetic prompts) miss real clinical complexity and factual risk. We need a disease-focused, expert-grounded benchmark of real patient questions plus a factuality check to evaluate both completeness and hallucinations for safe patient-facing use.
Main Contribution
Created PanCanBench: 282 real patient questions with 3,130 question-specific rubric criteria.
Built a human-in-the-loop rubric pipeline combining oncology fellows and AI to produce validated rubrics.
Evaluated 22 LLMs across rubric-based completeness, factual-error detection, and web-search integration.
Validated an LLM-as-a-judge factuality pipeline (dual-model cross-check) and compared AI- vs human-generated rubrics.
Key Findings
Models' rubric-based completeness scores vary widely.
Factual errors (hallucinations) are common and differ by model.
Web search did not reliably improve overall rubric scores.
AI-generated rubrics inflate scores versus human rubrics.
LLM-as-a-judge shows human-comparable agreement for grading.
Results
Top rubric-based average score (model)
Example model scores
Hallucination / factual-error rate by model
LLM-as-a-judge reliability
Web-search effect (examples)
AI-generated rubric inflation
Who Should Care
What To Try In 7 Days
Run PanCanBench (public repo and dataset) on your model to get domain-specific completeness and hallucination metrics.
Validate any LLM-as-a-judge you use: compute Cohen's κ and F1 against clinician graders on a held-out subset.
Test your web-search integration: compare with search off to check for omission or crowding out of internal knowledge.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single disease focus (pancreatic cancer) limits generalization to other specialties
- Moderate sample size (282 questions) may miss rarer clinical scenarios
- Potential biases from LLM-based grading despite validation
- Binary rubric scoring can miss nuanced partial knowledge
When Not To Use
- Do not generalize results to other diseases without re-running domain-specific rubrics
- Do not rely solely on AI-generated rubrics for safety-critical decisions
- Avoid deploying models with high hallucination rates for unsupervised patient-facing use
Failure Modes
- Hallucinations that assert incorrect staging or treatment facts
- Omission of critical details when web search is enabled ('crowding out')
- Citing low-quality or non-peer-reviewed sources for clinical claims
- Brief responses that fail to meet breadth required by expert rubrics
Core Entities
Models
- GPT-5
- o3
- Grok-4
- Gemini-2.5 Pro
- Gemini-2.5 Flash
- GPT-4o
- GPT-4.1
- Llama-3.1-70B
- Llama-3.1-8B
- Olmo-3.1-32B-Instruct
- Olmo-3-32b-think
- Qwen-3-32B
- Qwen-3-14B
- Qwen-3-8B
- Gemma-3-27B
- Gemma-3-12B
Metrics
- Rubric-based completeness (%)
- Hallucination rate (% responses with ≥1 factual error)
- Percentage of supportive links
- Web search triggering rate (%)
- Web search resource appropriateness rate (%)
- Cohen's κ
- F1 score (LLM-as-judge)
Datasets
- PanCanBench (282 questions, 3,130 rubric items)
Benchmarks
- PanCanBench
- HealthBench (comparison)

