How LLMs are reshaping healthcare: capabilities, data needs, risks, and where to start

October 9, 20238 min

Overview

Decision SnapshotNeeds Validation

LLMs are ready for prototyping documentation, QA triage, and research support; they are not yet safe to use without human oversight for critical clinical decisions due to hallucination, bias, and privacy risks.

Citations28

Evidence Strength0.60

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 40%

Novelty: 30%

Authors

Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, Erik Cambria

Links

Abstract / PDF / Code

Why It Matters For Business

LLMs now match or approach clinician-level performance on some exam-style tasks and can speed documentation, triage, and literature review—but risk, privacy, and integration costs mean businesses must plan governance and hybrid human+AI workflows.

Who Should Care

Summary TLDR

This is a broad, practice-oriented survey of large language models (LLMs) applied to healthcare. It maps what LLMs do better than older pretrained models (PLMs), shows common training recipes (instruction fine-tuning is dominant), summarizes datasets and compute needs, and flags the main barriers: fairness, accountability, transparency, privacy, and ethics. Benchmarks (USMLE, MedMCQA, PubMedQA) show specialized LLMs (Med-PaLM 2) and general LLMs (GPT-4) are near clinician-level on some exam-style tasks, but real-world deployment is limited by hallucination, bias, data access, integration, and regulatory gaps.

Problem Statement

Healthcare needs models that can read and reason across complex, multimodal medical data. PLMs worked well for narrow, labeled tasks but struggle with open-ended QA, multi-turn dialogue, and multimodal inputs. The paper surveys how LLMs close capability gaps and what technical, data, and ethical barriers remain for real-world use.

Main Contribution

Comprehensive survey of LLM capabilities, data, training methods, and task performance in healthcare.

Comparison between PLMs and LLMs across common clinical NLP tasks, highlighting when each is preferable.

Key Findings

Top LLMs approach human performance on exam-style medical questions.

NumbersUSMLE: GPT-4 86.7%, Med‑PaLM 2 86.5%, Human 87.0%

Practical UseLLMs can be effective for knowledge retrieval and test-like QA; use them for decision support and education, not autonomous clinical decisions.

Evidence RefTable 4

Instruction fine-tuning (SFT) is the most common method to adapt LLMs for medicine.

Numbers21 published healthcare LLMs reported using SFT

Practical UseIf you need a medical LLM quickly, prioritize collecting good instruction/QA/dialogue pairs and SFT rather than training from scratch.

Evidence RefTable 3; Sec. 3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4 86.7%, Med‑PaLM 2 86.5%, Human 87.0%FT BERT 44.62%USMLE (exam-style)Table 4: model vs human exam scoresTable 4
AccuracyGPT-4 73.66%, Med‑PaLM 2 72.30%, Aloe-Alpha 64.47%FT BERT 43.03%MedMCQA (multiple-choice medical questions)Table 4: benchmark scores reported by surveyTable 4

What To Try In 7 Days

Run a focused pilot: use an LLM (GPT-4 or an open LLaMA-based model) with retrieval (RAG) on 100 real questions from your domain to measure accuracy and hallucination rate.

Assemble 500–2,000 in-domain instruction/QA examples and fine-tune (LoRA or SFT) a small LLM to test performance improvement versus prompts only.

Run a quick bias and privacy risk audit on your training/evaluation data and add simple mitigation (resampling / importance weighting) for underrepresented groups.

Optimization Features

Infra Optimization
favor fine-tuning over training from scratch to cut GPU time
Model Optimization
LoRA
Training Optimization
SFT

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Hallucinations: models can fabricate facts or citations.

Data privacy and leakage risks from EHR-trained models.

When Not To Use

Automated, unsupervised clinical diagnosis without clinician review.

When patient data privacy cannot be guaranteed.

Failure Modes

Confident but incorrect answers (hallucination).

Systematic underdiagnosis for underrepresented subgroups.

Core Entities

Models

Med-PaLM 2GPT-4Med-PaLMGalacticaGatorTronHuatuoGPTMedAlpacaChatDoctorLLaVA-MedVisual Med-Alpaca

Metrics

Accuracymacro-F1human preference

Datasets

USMLEMedMCQAPubMedQAMIMIC-IIIMIMIC-CXRPubMedPMCCheXpert

Benchmarks

MMLU-MedicalMultiMedQAMedMCQAPubMedQAUSMLE

Context Entities

Models

GalacticaPMC-LLaMAGatorTronGPTAloe-AlphaJMLRQilin-Med

Metrics

computational cost (GPU hours)dataset size (tokens/samples)

Datasets

MedDialogCOMETA (Reddit)PadChestOpenPathWorldMedQA-V

Benchmarks

VQA-RADPath-VQA