Biomedical LLMs often underperform general models on unseen clinical data

August 25, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.35

Cost Impact Score

0.3

Citation Count

5

Authors

Felix J. Dorfner, Amin Dada, Felix Busch, Marcus R. Makowski, Tianyu Han, Daniel Truhn, Jens Kleesiek, Madhumita Sushil, Jacqueline Lammert, Lisa C. Adams, Keno K. Bressem

Links

Abstract / PDF

Why It Matters For Business

Fine-tuning on public biomedical text does not reliably boost performance on new clinical tasks and can reduce reliability; use large general models or retrieval systems for production clinical features.

Summary TLDR

The authors compared many biomedical fine-tuned LLMs with their general-purpose counterparts on multiple ‘‘unseen’’ clinical tasks (NEJM and JAMA case vignettes, CLUE tasks like MeDiSumQA/Code, MedNLI, LongHealth). Generalist models (notably Llama-3-70B-Instruct) matched or beat biomedical models on most tasks. Smaller biomedical models often performed much worse. Biomedical fine-tuning sometimes increased hallucinations and reduced general knowledge. The paper argues that off-the-shelf fine-tuning on public biomedical text does not reliably improve real-world clinical performance and suggests retrieval-augmented approaches instead.

Problem Statement

Researchers commonly fine-tune LLMs on biomedical text to boost clinical performance. But it is unclear whether that fine-tuning actually helps on truly new clinical data. This paper tests whether biomedical LLMs generalize better than large generalist models on recent, likely-unseen clinical benchmarks.

Main Contribution

A broad side-by-side evaluation of multiple biomedical and generalist LLMs across recent unseen clinical datasets and CLUE benchmark tasks.

Empirical finding that general-purpose models (especially Llama-3-70B-Instruct) match or outperform biomedical fine-tuned models on many clinical tasks.

Evidence that smaller biomedical models often underperform and that biomedical fine-tuning can increase hallucination risk on long-document tasks.

Practical recommendation to consider retrieval-augmented generation (RAG) or careful fine-tuning strategies instead of naive continued pretraining.

Key Findings

Generalist models often outperform biomedical fine-tuned models on unseen clinical case vignettes.

NumbersJAMA: OpenBioLLM-70B 66.4% vs Llama-3-70B-Instruct 65%

Smaller biomedical models can perform much worse than similarly sized generalist models.

NumbersNEJM: OpenBioLLM-8B 30% vs Llama-3-8B-Instruct 64.3%

Large generalist models strongly outperform biomedical variants on medical coding for discharge summaries.

NumbersMeDiSumCode Valid Code Acc: Llama-3-70B-Instruct 93.94% vs OpenBioLLM-70B 73.65%

Biomedical fine-tuning can increase hallucination risk on long clinical documents.

NumbersLongHealth Task 3: OpenBioLLM-8B 1.55 vs Llama-3-70B-Instruct 91.70

Larger models show smaller performance gaps between biomedical and generalist versions.

NumbersNEJM: OpenBioLLM-70B 74.1% vs Llama-3-70B-Instruct 74.6%

Results

Accuracy

ValueOpenBioLLM-70B 66.4%, Llama-3-70B-Instruct 65%

BaselineLlama-3-70B-Instruct

Accuracy

ValueLlama-3-70B-Instruct 74.6%, OpenBioLLM-70B 74.1%

BaselineLlama-3-70B-Instruct

Accuracy

ValueLlama-3-70B-Instruct 93.94%, OpenBioLLM-70B 73.65%

BaselineLlama-3-70B-Instruct

LongHealth Task 3 (hallucination score)

ValueLlama-3-70B-Instruct 91.70, OpenBioLLM-8B 1.55

BaselineLlama-3-70B-Instruct

Accuracy

ValueOpenBioLLM-70B 80.85%, Llama-3-70B-Instruct 79.37%

BaselineLlama-3-70B-Instruct

Who Should Care

What To Try In 7 Days

Benchmark your current biomedical model against a strong generalist (e.g., Llama-3-70B-Instruct) on 1–2 representative unseen clinical tasks.

Run a hallucination test (e.g., LongHealth Task 3) to gauge safety risk before any deployment.

Prototype a retrieval-augmented generation (RAG) pipeline for one summarization or coding task and compare outputs vs. your fine-tuned model.

Reproducibility

Data Urls

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmarks (NEJM, JAMA) are publicly available and may partially appear in model training data, possibly inflating generalist model performance.
  • Some evaluated biomedical models do not disclose training data, limiting causal claims about fine-tuning effects.
  • Benchmarks do not cover all real-world clinical complexity such as longitudinal patient histories or treatment planning.

When Not To Use

  • Do not assume small, publicly fine-tuned biomedical models are safer or more accurate for unseen clinical tasks.
  • Avoid deploying biomedical LLMs as sole clinical knowledge sources without hallucination checks and retrieval safeguards.

Failure Modes

  • Hallucinations on long or out-of-context documents, especially for some biomedical fine-tuned models.
  • Catastrophic forgetting: fine-tuning may reduce general knowledge and harm generalization.
  • Overfitting or data leakage if fine-tuning data overlaps evaluation sets.

Core Entities

Models

  • Llama-3-70B-Instruct
  • Llama-3-8B-Instruct
  • OpenBioLLM-70B
  • OpenBioLLM-8B
  • Mistral-7B-Instruct-v0.2
  • BioMistral-7B
  • SFT
  • MedAlpaca-7B
  • PMC-Llama-7B
  • Meditron-7B
  • Med42-70B
  • ClinicalCamel-70B

Metrics

  • Accuracy
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • BERT F1
  • F1 (EM/AP)
  • UMLS F1

Datasets

  • NEJM case challenges
  • JAMA case challenges
  • MeDiSumQA
  • MeDiSumCode
  • MedNLI
  • MeQSum
  • ProblemSummary
  • LongHealth
  • MIMIC-IV
  • MIMIC-III
  • CLUE benchmark

Benchmarks

  • CLUE
  • NEJM case challenges
  • JAMA case challenges
  • LongHealth

Context Entities

Models

  • Llama-2-7b-chat-hf
  • Llama-2-70b-chat-hf
  • Mistral-7B

Datasets

  • USMLE (mentioned as common benchmark)
  • MMLU (mentioned)