Overview
Production Readiness
0.4
Novelty Score
0.3
Cost Impact Score
0.7
Citation Count
28
Why It Matters For Business
LLMs now match or approach clinician-level performance on some exam-style tasks and can speed documentation, triage, and literature review—but risk, privacy, and integration costs mean businesses must plan governance and hybrid human+AI workflows.
Summary TLDR
This is a broad, practice-oriented survey of large language models (LLMs) applied to healthcare. It maps what LLMs do better than older pretrained models (PLMs), shows common training recipes (instruction fine-tuning is dominant), summarizes datasets and compute needs, and flags the main barriers: fairness, accountability, transparency, privacy, and ethics. Benchmarks (USMLE, MedMCQA, PubMedQA) show specialized LLMs (Med-PaLM 2) and general LLMs (GPT-4) are near clinician-level on some exam-style tasks, but real-world deployment is limited by hallucination, bias, data access, integration, and regulatory gaps.
Problem Statement
Healthcare needs models that can read and reason across complex, multimodal medical data. PLMs worked well for narrow, labeled tasks but struggle with open-ended QA, multi-turn dialogue, and multimodal inputs. The paper surveys how LLMs close capability gaps and what technical, data, and ethical barriers remain for real-world use.
Main Contribution
Comprehensive survey of LLM capabilities, data, training methods, and task performance in healthcare.
Comparison between PLMs and LLMs across common clinical NLP tasks, highlighting when each is preferable.
Practical discussion of non-technical barriers—fairness, accountability, transparency, and ethics—and a compiled list of public datasets and compute stats.
Key Findings
Top LLMs approach human performance on exam-style medical questions.
Instruction fine-tuning (SFT) is the most common method to adapt LLMs for medicine.
Training LLMs from scratch is expensive and rare for medical models.
RLHF is used infrequently in medical LLMs due to cost and instability.
Fairness, accountability, transparency, and ethics are the largest non-technical barriers.
Results
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run a focused pilot: use an LLM (GPT-4 or an open LLaMA-based model) with retrieval (RAG) on 100 real questions from your domain to measure accuracy and hallucination rate.
Assemble 500–2,000 in-domain instruction/QA examples and fine-tune (LoRA or SFT) a small LLM to test performance improvement versus prompts only.
Run a quick bias and privacy risk audit on your training/evaluation data and add simple mitigation (resampling / importance weighting) for underrepresented groups.
Optimization Features
Infra Optimization
- favor fine-tuning over training from scratch to cut GPU time
Model Optimization
- LoRA
Training Optimization
- SFT
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Hallucinations: models can fabricate facts or citations.
- Data privacy and leakage risks from EHR-trained models.
- Benchmark gaps: exam-style tests do not equal clinical safety.
- Prompt brittleness: outputs change with small prompt edits.
- Integration friction with hospital IT and varied data standards.
When Not To Use
- Automated, unsupervised clinical diagnosis without clinician review.
- When patient data privacy cannot be guaranteed.
- In fast emergency decisions where latency and interpretability matter.
Failure Modes
- Confident but incorrect answers (hallucination).
- Systematic underdiagnosis for underrepresented subgroups.
- Catastrophic forgetting when applying unstable RLHF updates.
- Performance shifts due to prompt format or data distribution change.
Core Entities
Models
- Med-PaLM 2
- GPT-4
- Med-PaLM
- Galactica
- GatorTron
- HuatuoGPT
- MedAlpaca
- ChatDoctor
- LLaVA-Med
- Visual Med-Alpaca
Metrics
- Accuracy
- macro-F1
- human preference
Datasets
- USMLE
- MedMCQA
- PubMedQA
- MIMIC-III
- MIMIC-CXR
- PubMed
- PMC
- CheXpert
Benchmarks
- MMLU-Medical
- MultiMedQA
- MedMCQA
- PubMedQA
- USMLE
Context Entities
Models
- Galactica
- PMC-LLaMA
- GatorTronGPT
- Aloe-Alpha
- JMLR
- Qilin-Med
Metrics
- computational cost (GPU hours)
- dataset size (tokens/samples)
Datasets
- MedDialog
- COMETA (Reddit)
- PadChest
- OpenPath
- WorldMedQA-V
Benchmarks
- VQA-RAD
- Path-VQA

