A 7B cancer-specialized LLM that matches or beats larger models on phenotype extraction and diagnosis generation

Overview

Decision SnapshotNeeds Validation

The model demonstrates strong task-specific gains on a large single-site EHR corpus and shows practical GPU efficiency; however, private data, limited external validation, and sensitivity to noise reduce immediate deployability without site-specific testing.

Citations11

Evidence Strength0.60

Confidence0.75

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 75%

Production readiness: 60%

Novelty: 45%

Authors

Mingchen Li, Jiatan Huang, Jeremy Yeung, Anne Blaes, Steven Johnson, Hongfang Liu, Hua Xu, Rui Zhang

Links

Abstract / PDF

Why It Matters For Business

CancerLLM shows that a domain-tuned 7B model can reach or exceed larger models on cancer tasks while using far less GPU memory, lowering operational cost for hospitals and clinics.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

The authors built CancerLLM, a 7-billion-parameter Mistral-style model pre-trained on ~2.7M clinical notes and ~515K pathology reports from a single health system. After LoRA continued pre-training and instruction tuning on two new cancer-focused datasets (phenotype extraction and diagnosis generation), CancerLLM reaches 91.78% average F1 on phenotype extraction and 86.81% average F1 on diagnosis generation on their evaluation sets. The model is compact and GPU-efficient compared with larger medical LLMs. The paper also adds simple retrieval-augmented variants and two robustness testbeds (counterfactual labels and misspellings). Data and code are not released; training used private EHRs from

Problem Statement

General medical LLMs lack focused cancer knowledge and hospitals often cannot run very large models. There is also a shortage of cancer-specific, EHR-grounded datasets for evaluating extraction and diagnosis tasks. The paper aims to produce a smaller, cancer-adapted LLM and new evaluation sets to improve and test cancer-related extraction and generation.

Main Contribution

CancerLLM: a 7B Mistral-based LLM pre-trained on 2.68M cancer clinical notes + 515K pathology reports and instruction-tuned for cancer tasks.

Two task datasets and evaluation setup: phenotype extraction (8 entity types) and diagnosis generation (ICD-aligned test set).

Key Findings

CancerLLM achieves state-of-the-art average F1 on diagnosis generation among evaluated models.

NumbersDiagnosis average F1 = 86.81% (Table 1)

Practical UseA 7B domain-tuned model can give clinically relevant diagnosis text quality similar to or better than much larger models, lowering deployment cost.

Evidence RefTable 1

CancerLLM reaches 91.78% average F1 on phenotype extraction on the authors' test set.

NumbersPhenotype extraction average F1 = 91.78% (Table 2)

Practical UseFor structured phenotype extraction (tumor size, stage, receptors), a compact domain model performs at near top-tier levels for this dataset.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Diagnosis generation average F1 (ExactMatch,BLEU-2,ROUGE-L)	86.81%	Best non-CancerLLM 7B/13B/70B baseline varied (e.g., Bio-Mistral 7B = 68.89%)	+17.92% vs Bio-Mistral 7B	Authors' diagnosis test set (ICDdiagnosis)	Table 1: CancerLLM 86.81% avg F1	Table 1
Phenotype extraction average F1 (ExactMatch,BLEU-2,ROUGE-L)	91.78%	ClinicalCamel-70B = 93.72% (larger model)	-1.94% vs ClinicalCamel-70B	Authors' phenotype test set (CancerNER transformed)	Table 2: CancerLLM 91.78% avg F1	Table 2

What To Try In 7 Days

Evaluate CancerLLM (or similar 7B domain model) on a small labeled internal dataset to check local performance.

Add a retrieval layer (Specter2 or Contriever) to see if diagnosis quality improves for your notes.

Run simple spelling normalization and abbreviation expansion before feeding clinical notes to reduce errors.

Optimization Features

Token Efficiency

Task-specific max input/new token settings (1500/50 or 1500/500)

Model Optimization

LoRAStart from Mistral-7B base

System Optimization

Parallelized pretraining across ten A100 GPUs

Training Optimization

LoRA

Inference Optimization

Small model size (7B) for single-A100 inferenceReduced max new token lengths per task to bound compute

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Training data are private, single-institution EHRs; external generalization is untested.

Models are sensitive to misspellings and abbreviations in clinical notes.

When Not To Use

Where cross-site generalization is required without local fine-tuning.

On noisy, uncleaned clinical text without preprocessing.

Failure Modes

Incomplete generation that omits key clinical details.

Irrelevant or redundant outputs when context is ambiguous.

Core Entities

Models

CancerLLM-7BMistral-7BBio-Mistral-7BClinicalCamel-70BMixtral-8x7BLLama2-13BQwen-7B

Metrics

Exact MatchBLEU-2ROUGE-LAverage F1 (across the three)

Datasets

UMN clinical notes (2,676,642)UMN pathology reports (515,524)CancerNER (phenotype NER, transformed for QA)ICDdiagnosis test set (374 notes)Diagnosis generation training set (10,635 notes)

Benchmarks

Cancer phenotype extractionCancer diagnosis generationCounterfactual robustness testbedMisspelling robustness testbed

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CancerLLM achieves state-of-the-art average F1 on diagnosis generation among evaluated models.

CancerLLM reaches 91.78% average F1 on phenotype extraction on the authors' test set.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

Key finding

ChipExpert: Open-source LLM tuned for integrated-circuit design

Key finding