A 7B cancer-specialized LLM that matches or beats larger models on phenotype extraction and diagnosis generation

June 15, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.45

Cost Impact Score

0.75

Citation Count

11

Authors

Mingchen Li, Jiatan Huang, Jeremy Yeung, Anne Blaes, Steven Johnson, Hongfang Liu, Hua Xu, Rui Zhang

Links

Abstract / PDF

Why It Matters For Business

CancerLLM shows that a domain-tuned 7B model can reach or exceed larger models on cancer tasks while using far less GPU memory, lowering operational cost for hospitals and clinics.

Summary TLDR

The authors built CancerLLM, a 7-billion-parameter Mistral-style model pre-trained on ~2.7M clinical notes and ~515K pathology reports from a single health system. After LoRA continued pre-training and instruction tuning on two new cancer-focused datasets (phenotype extraction and diagnosis generation), CancerLLM reaches 91.78% average F1 on phenotype extraction and 86.81% average F1 on diagnosis generation on their evaluation sets. The model is compact and GPU-efficient compared with larger medical LLMs. The paper also adds simple retrieval-augmented variants and two robustness testbeds (counterfactual labels and misspellings). Data and code are not released; training used private EHRs from

Problem Statement

General medical LLMs lack focused cancer knowledge and hospitals often cannot run very large models. There is also a shortage of cancer-specific, EHR-grounded datasets for evaluating extraction and diagnosis tasks. The paper aims to produce a smaller, cancer-adapted LLM and new evaluation sets to improve and test cancer-related extraction and generation.

Main Contribution

CancerLLM: a 7B Mistral-based LLM pre-trained on 2.68M cancer clinical notes + 515K pathology reports and instruction-tuned for cancer tasks.

Two task datasets and evaluation setup: phenotype extraction (8 entity types) and diagnosis generation (ICD-aligned test set).

RAG variants with five retrievers and robustness testbeds for counterfactual labels and misspellings.

Key Findings

CancerLLM achieves state-of-the-art average F1 on diagnosis generation among evaluated models.

NumbersDiagnosis average F1 = 86.81% (Table 1)

CancerLLM reaches 91.78% average F1 on phenotype extraction on the authors' test set.

NumbersPhenotype extraction average F1 = 91.78% (Table 2)

Pretraining used a large, private EHR corpus: ~2.68M clinical notes and ~515K pathology reports from one institution.

Numbers2,676,642 clinical notes; 515,524 pathology reports (Section 4.1)

Retrieval improves diagnosis generation; Specter2 gave the best retrieval-augmented results.

NumbersSpecter2 diagnosis F1 = 89.12% vs no-retriever 86.81% (Table 6)

Model and inference are GPU-efficient compared to large baselines.

NumbersPhenotype inference: 1:14:12 time, 5,550MB GPU; ClinicalCamel-70B: 2:50:16, 37,716MB (Table 5)

Performance drops sharply under noisy inputs: misspellings and high counterfactual label rates reduce accuracy.

NumbersMisspelling avg F1 ≈ 12% for diagnosis task; counterfactual tests show F1 decline as label error rate rises (Tables 3–4)

Results

Diagnosis generation average F1 (ExactMatch,BLEU-2,ROUGE-L)

Value86.81%

BaselineBest non-CancerLLM 7B/13B/70B baseline varied (e.g., Bio-Mistral 7B = 68.89%)

Phenotype extraction average F1 (ExactMatch,BLEU-2,ROUGE-L)

Value91.78%

BaselineClinicalCamel-70B = 93.72% (larger model)

Retrieval-augmented diagnosis generation (best retriever)

Value89.12% average F1 (Specter2)

BaselineNo-retriever = 86.81%

Inference time and single-A100 memory (phenotype extraction)

Value1:14:12; 5,550 MB

BaselineClinicalCamel-70B: 2:50:16; 37,716 MB

Who Should Care

What To Try In 7 Days

Evaluate CancerLLM (or similar 7B domain model) on a small labeled internal dataset to check local performance.

Add a retrieval layer (Specter2 or Contriever) to see if diagnosis quality improves for your notes.

Run simple spelling normalization and abbreviation expansion before feeding clinical notes to reduce errors.

Optimization Features

Token Efficiency

  • Task-specific max input/new token settings (1500/50 or 1500/500)

Model Optimization

  • LoRA
  • Start from Mistral-7B base

System Optimization

  • Parallelized pretraining across ten A100 GPUs

Training Optimization

  • LoRA

Inference Optimization

  • Small model size (7B) for single-A100 inference
  • Reduced max new token lengths per task to bound compute

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Training data are private, single-institution EHRs; external generalization is untested.
  • Models are sensitive to misspellings and abbreviations in clinical notes.
  • Annotation quality affects phenotype extraction; noisy labels degrade performance.
  • Code and datasets are not released, limiting reproducibility.

When Not To Use

  • Where cross-site generalization is required without local fine-tuning.
  • On noisy, uncleaned clinical text without preprocessing.
  • When public, auditable datasets or open-source models are required.

Failure Modes

  • Incomplete generation that omits key clinical details.
  • Irrelevant or redundant outputs when context is ambiguous.
  • Errors from misspellings and abbreviation misinterpretation.
  • Degraded outputs under high label-noise during training.

Core Entities

Models

  • CancerLLM-7B
  • Mistral-7B
  • Bio-Mistral-7B
  • ClinicalCamel-70B
  • Mixtral-8x7B
  • LLama2-13B
  • Qwen-7B

Metrics

  • Exact Match
  • BLEU-2
  • ROUGE-L
  • Average F1 (across the three)

Datasets

  • UMN clinical notes (2,676,642)
  • UMN pathology reports (515,524)
  • CancerNER (phenotype NER, transformed for QA)
  • ICDdiagnosis test set (374 notes)
  • Diagnosis generation training set (10,635 notes)

Benchmarks

  • Cancer phenotype extraction
  • Cancer diagnosis generation
  • Counterfactual robustness testbed
  • Misspelling robustness testbed