How LLMs are reshaping healthcare: capabilities, data needs, risks, and where to start

Overview

Decision SnapshotNeeds Validation

LLMs are ready for prototyping documentation, QA triage, and research support; they are not yet safe to use without human oversight for critical clinical decisions due to hallucination, bias, and privacy risks.

Citations28

Evidence Strength0.60

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 40%

Novelty: 30%

Authors

Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, Erik Cambria

Links

Abstract / PDF / Code

Why It Matters For Business

LLMs now match or approach clinician-level performance on some exam-style tasks and can speed documentation, triage, and literature review—but risk, privacy, and integration costs mean businesses must plan governance and hybrid human+AI workflows.

Who Should Care

CTO Product Manager ML Engineer Data Scientist CEO Founder

Summary TLDR

This is a broad, practice-oriented survey of large language models (LLMs) applied to healthcare. It maps what LLMs do better than older pretrained models (PLMs), shows common training recipes (instruction fine-tuning is dominant), summarizes datasets and compute needs, and flags the main barriers: fairness, accountability, transparency, privacy, and ethics. Benchmarks (USMLE, MedMCQA, PubMedQA) show specialized LLMs (Med-PaLM 2) and general LLMs (GPT-4) are near clinician-level on some exam-style tasks, but real-world deployment is limited by hallucination, bias, data access, integration, and regulatory gaps.

Problem Statement

Healthcare needs models that can read and reason across complex, multimodal medical data. PLMs worked well for narrow, labeled tasks but struggle with open-ended QA, multi-turn dialogue, and multimodal inputs. The paper surveys how LLMs close capability gaps and what technical, data, and ethical barriers remain for real-world use.

Main Contribution

Comprehensive survey of LLM capabilities, data, training methods, and task performance in healthcare.

Comparison between PLMs and LLMs across common clinical NLP tasks, highlighting when each is preferable.

Key Findings

Top LLMs approach human performance on exam-style medical questions.

NumbersUSMLE: GPT-4 86.7%, Med‑PaLM 2 86.5%, Human 87.0%

Practical UseLLMs can be effective for knowledge retrieval and test-like QA; use them for decision support and education, not autonomous clinical decisions.

Evidence RefTable 4

Instruction fine-tuning (SFT) is the most common method to adapt LLMs for medicine.

Numbers21 published healthcare LLMs reported using SFT

Practical UseIf you need a medical LLM quickly, prioritize collecting good instruction/QA/dialogue pairs and SFT rather than training from scratch.

Evidence RefTable 3; Sec. 3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4 86.7%, Med‑PaLM 2 86.5%, Human 87.0%	FT BERT 44.62%	—	USMLE (exam-style)	Table 4: model vs human exam scores	Table 4
Accuracy	GPT-4 73.66%, Med‑PaLM 2 72.30%, Aloe-Alpha 64.47%	FT BERT 43.03%	—	MedMCQA (multiple-choice medical questions)	Table 4: benchmark scores reported by survey	Table 4

What To Try In 7 Days

Run a focused pilot: use an LLM (GPT-4 or an open LLaMA-based model) with retrieval (RAG) on 100 real questions from your domain to measure accuracy and hallucination rate.

Assemble 500–2,000 in-domain instruction/QA examples and fine-tune (LoRA or SFT) a small LLM to test performance improvement versus prompts only.

Run a quick bias and privacy risk audit on your training/evaluation data and add simple mitigation (resampling / importance weighting) for underrepresented groups.

Optimization Features

Infra Optimization

favor fine-tuning over training from scratch to cut GPU time

Model Optimization

LoRA

Training Optimization

SFT

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/KaiHe-CatOwner/LLM-for-Healthcare

Risks & Boundaries

Limitations

Hallucinations: models can fabricate facts or citations.

Data privacy and leakage risks from EHR-trained models.

When Not To Use

Automated, unsupervised clinical diagnosis without clinician review.

When patient data privacy cannot be guaranteed.

Failure Modes

Confident but incorrect answers (hallucination).

Systematic underdiagnosis for underrepresented subgroups.

Core Entities

Models

Med-PaLM 2GPT-4Med-PaLMGalacticaGatorTronHuatuoGPTMedAlpacaChatDoctorLLaVA-MedVisual Med-Alpaca

Metrics

Accuracymacro-F1human preference

Datasets

USMLEMedMCQAPubMedQAMIMIC-IIIMIMIC-CXRPubMedPMCCheXpert

Benchmarks

MMLU-MedicalMultiMedQAMedMCQAPubMedQAUSMLE

Context Entities

Models

GalacticaPMC-LLaMAGatorTronGPTAloe-AlphaJMLRQilin-Med

Metrics

computational cost (GPU hours)dataset size (tokens/samples)

Datasets

MedDialogCOMETA (Reddit)PadChestOpenPathWorldMedQA-V

Benchmarks

VQA-RADPath-VQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top LLMs approach human performance on exam-style medical questions.

Instruction fine-tuning (SFT) is the most common method to adapt LLMs for medicine.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding