How LLMs are reshaping healthcare: capabilities, data needs, risks, and where to start

October 9, 20238 min

Overview

Production Readiness

0.4

Novelty Score

0.3

Cost Impact Score

0.7

Citation Count

28

Authors

Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, Erik Cambria

Links

Abstract / PDF

Why It Matters For Business

LLMs now match or approach clinician-level performance on some exam-style tasks and can speed documentation, triage, and literature review—but risk, privacy, and integration costs mean businesses must plan governance and hybrid human+AI workflows.

Summary TLDR

This is a broad, practice-oriented survey of large language models (LLMs) applied to healthcare. It maps what LLMs do better than older pretrained models (PLMs), shows common training recipes (instruction fine-tuning is dominant), summarizes datasets and compute needs, and flags the main barriers: fairness, accountability, transparency, privacy, and ethics. Benchmarks (USMLE, MedMCQA, PubMedQA) show specialized LLMs (Med-PaLM 2) and general LLMs (GPT-4) are near clinician-level on some exam-style tasks, but real-world deployment is limited by hallucination, bias, data access, integration, and regulatory gaps.

Problem Statement

Healthcare needs models that can read and reason across complex, multimodal medical data. PLMs worked well for narrow, labeled tasks but struggle with open-ended QA, multi-turn dialogue, and multimodal inputs. The paper surveys how LLMs close capability gaps and what technical, data, and ethical barriers remain for real-world use.

Main Contribution

Comprehensive survey of LLM capabilities, data, training methods, and task performance in healthcare.

Comparison between PLMs and LLMs across common clinical NLP tasks, highlighting when each is preferable.

Practical discussion of non-technical barriers—fairness, accountability, transparency, and ethics—and a compiled list of public datasets and compute stats.

Key Findings

Top LLMs approach human performance on exam-style medical questions.

NumbersUSMLE: GPT-4 86.7%, Med‑PaLM 2 86.5%, Human 87.0%

Instruction fine-tuning (SFT) is the most common method to adapt LLMs for medicine.

Numbers21 published healthcare LLMs reported using SFT

Training LLMs from scratch is expensive and rare for medical models.

NumbersGatorTron: ~992 A100 GPUs for 6 days; other PT-from-scratch efforts limited

RLHF is used infrequently in medical LLMs due to cost and instability.

NumbersOnly a few models (e.g., HuatuoGPT, MedAlpaca variants) report RLHF use

Fairness, accountability, transparency, and ethics are the largest non-technical barriers.

NumbersRepeated emphasis across multiple sections; dataset bias and privacy risk examples cited

Results

Accuracy

ValueGPT-4 86.7%, Med‑PaLM 2 86.5%, Human 87.0%

BaselineFT BERT 44.62%

Accuracy

ValueGPT-4 73.66%, Med‑PaLM 2 72.30%, Aloe-Alpha 64.47%

BaselineFT BERT 43.03%

Accuracy

ValueAloe-Alpha 80.2%, Med‑PaLM 2 81.8%, GPT-4 80.4%

BaselineFT BERT 72.20%

Who Should Care

What To Try In 7 Days

Run a focused pilot: use an LLM (GPT-4 or an open LLaMA-based model) with retrieval (RAG) on 100 real questions from your domain to measure accuracy and hallucination rate.

Assemble 500–2,000 in-domain instruction/QA examples and fine-tune (LoRA or SFT) a small LLM to test performance improvement versus prompts only.

Run a quick bias and privacy risk audit on your training/evaluation data and add simple mitigation (resampling / importance weighting) for underrepresented groups.

Optimization Features

Infra Optimization

  • favor fine-tuning over training from scratch to cut GPU time

Model Optimization

  • LoRA

Training Optimization

  • SFT

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Hallucinations: models can fabricate facts or citations.
  • Data privacy and leakage risks from EHR-trained models.
  • Benchmark gaps: exam-style tests do not equal clinical safety.
  • Prompt brittleness: outputs change with small prompt edits.
  • Integration friction with hospital IT and varied data standards.

When Not To Use

  • Automated, unsupervised clinical diagnosis without clinician review.
  • When patient data privacy cannot be guaranteed.
  • In fast emergency decisions where latency and interpretability matter.

Failure Modes

  • Confident but incorrect answers (hallucination).
  • Systematic underdiagnosis for underrepresented subgroups.
  • Catastrophic forgetting when applying unstable RLHF updates.
  • Performance shifts due to prompt format or data distribution change.

Core Entities

Models

  • Med-PaLM 2
  • GPT-4
  • Med-PaLM
  • Galactica
  • GatorTron
  • HuatuoGPT
  • MedAlpaca
  • ChatDoctor
  • LLaVA-Med
  • Visual Med-Alpaca

Metrics

  • Accuracy
  • macro-F1
  • human preference

Datasets

  • USMLE
  • MedMCQA
  • PubMedQA
  • MIMIC-III
  • MIMIC-CXR
  • PubMed
  • PMC
  • CheXpert

Benchmarks

  • MMLU-Medical
  • MultiMedQA
  • MedMCQA
  • PubMedQA
  • USMLE

Context Entities

Models

  • Galactica
  • PMC-LLaMA
  • GatorTronGPT
  • Aloe-Alpha
  • JMLR
  • Qilin-Med

Metrics

  • computational cost (GPU hours)
  • dataset size (tokens/samples)

Datasets

  • MedDialog
  • COMETA (Reddit)
  • PadChest
  • OpenPath
  • WorldMedQA-V

Benchmarks

  • VQA-RAD
  • Path-VQA