A 7B cancer-specialized LLM that matches or beats larger models on phenotype extraction and diagnosis generation

June 15, 20248 min

Overview

Decision SnapshotNeeds Validation

The model demonstrates strong task-specific gains on a large single-site EHR corpus and shows practical GPU efficiency; however, private data, limited external validation, and sensitivity to noise reduce immediate deployability without site-specific testing.

Citations11

Evidence Strength0.60

Confidence0.75

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 75%

Production readiness: 60%

Novelty: 45%

Authors

Mingchen Li, Jiatan Huang, Jeremy Yeung, Anne Blaes, Steven Johnson, Hongfang Liu, Hua Xu, Rui Zhang

Links

Abstract / PDF

Why It Matters For Business

CancerLLM shows that a domain-tuned 7B model can reach or exceed larger models on cancer tasks while using far less GPU memory, lowering operational cost for hospitals and clinics.

Who Should Care

Summary TLDR

The authors built CancerLLM, a 7-billion-parameter Mistral-style model pre-trained on ~2.7M clinical notes and ~515K pathology reports from a single health system. After LoRA continued pre-training and instruction tuning on two new cancer-focused datasets (phenotype extraction and diagnosis generation), CancerLLM reaches 91.78% average F1 on phenotype extraction and 86.81% average F1 on diagnosis generation on their evaluation sets. The model is compact and GPU-efficient compared with larger medical LLMs. The paper also adds simple retrieval-augmented variants and two robustness testbeds (counterfactual labels and misspellings). Data and code are not released; training used private EHRs from

Problem Statement

General medical LLMs lack focused cancer knowledge and hospitals often cannot run very large models. There is also a shortage of cancer-specific, EHR-grounded datasets for evaluating extraction and diagnosis tasks. The paper aims to produce a smaller, cancer-adapted LLM and new evaluation sets to improve and test cancer-related extraction and generation.

Main Contribution

CancerLLM: a 7B Mistral-based LLM pre-trained on 2.68M cancer clinical notes + 515K pathology reports and instruction-tuned for cancer tasks.

Two task datasets and evaluation setup: phenotype extraction (8 entity types) and diagnosis generation (ICD-aligned test set).

Key Findings

CancerLLM achieves state-of-the-art average F1 on diagnosis generation among evaluated models.

NumbersDiagnosis average F1 = 86.81% (Table 1)

Practical UseA 7B domain-tuned model can give clinically relevant diagnosis text quality similar to or better than much larger models, lowering deployment cost.

Evidence RefTable 1

CancerLLM reaches 91.78% average F1 on phenotype extraction on the authors' test set.

NumbersPhenotype extraction average F1 = 91.78% (Table 2)

Practical UseFor structured phenotype extraction (tumor size, stage, receptors), a compact domain model performs at near top-tier levels for this dataset.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Diagnosis generation average F1 (ExactMatch,BLEU-2,ROUGE-L)86.81%Best non-CancerLLM 7B/13B/70B baseline varied (e.g., Bio-Mistral 7B = 68.89%)+17.92% vs Bio-Mistral 7BAuthors' diagnosis test set (ICDdiagnosis)Table 1: CancerLLM 86.81% avg F1Table 1
Phenotype extraction average F1 (ExactMatch,BLEU-2,ROUGE-L)91.78%ClinicalCamel-70B = 93.72% (larger model)-1.94% vs ClinicalCamel-70BAuthors' phenotype test set (CancerNER transformed)Table 2: CancerLLM 91.78% avg F1Table 2

What To Try In 7 Days

Evaluate CancerLLM (or similar 7B domain model) on a small labeled internal dataset to check local performance.

Add a retrieval layer (Specter2 or Contriever) to see if diagnosis quality improves for your notes.

Run simple spelling normalization and abbreviation expansion before feeding clinical notes to reduce errors.

Optimization Features

Token Efficiency
Task-specific max input/new token settings (1500/50 or 1500/500)
Model Optimization
LoRAStart from Mistral-7B base
System Optimization
Parallelized pretraining across ten A100 GPUs
Training Optimization
LoRA
Inference Optimization
Small model size (7B) for single-A100 inferenceReduced max new token lengths per task to bound compute

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Training data are private, single-institution EHRs; external generalization is untested.

Models are sensitive to misspellings and abbreviations in clinical notes.

When Not To Use

Where cross-site generalization is required without local fine-tuning.

On noisy, uncleaned clinical text without preprocessing.

Failure Modes

Incomplete generation that omits key clinical details.

Irrelevant or redundant outputs when context is ambiguous.

Core Entities

Models

CancerLLM-7BMistral-7BBio-Mistral-7BClinicalCamel-70BMixtral-8x7BLLama2-13BQwen-7B

Metrics

Exact MatchBLEU-2ROUGE-LAverage F1 (across the three)

Datasets

UMN clinical notes (2,676,642)UMN pathology reports (515,524)CancerNER (phenotype NER, transformed for QA)ICDdiagnosis test set (374 notes)Diagnosis generation training set (10,635 notes)

Benchmarks

Cancer phenotype extractionCancer diagnosis generationCounterfactual robustness testbedMisspelling robustness testbed