Overview
LLMs are ready for prototyping documentation, QA triage, and research support; they are not yet safe to use without human oversight for critical clinical decisions due to hallucination, bias, and privacy risks.
Citations28
Evidence Strength0.60
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 40%
Novelty: 30%
Why It Matters For Business
LLMs now match or approach clinician-level performance on some exam-style tasks and can speed documentation, triage, and literature review—but risk, privacy, and integration costs mean businesses must plan governance and hybrid human+AI workflows.
Who Should Care
Summary TLDR
This is a broad, practice-oriented survey of large language models (LLMs) applied to healthcare. It maps what LLMs do better than older pretrained models (PLMs), shows common training recipes (instruction fine-tuning is dominant), summarizes datasets and compute needs, and flags the main barriers: fairness, accountability, transparency, privacy, and ethics. Benchmarks (USMLE, MedMCQA, PubMedQA) show specialized LLMs (Med-PaLM 2) and general LLMs (GPT-4) are near clinician-level on some exam-style tasks, but real-world deployment is limited by hallucination, bias, data access, integration, and regulatory gaps.
Problem Statement
Healthcare needs models that can read and reason across complex, multimodal medical data. PLMs worked well for narrow, labeled tasks but struggle with open-ended QA, multi-turn dialogue, and multimodal inputs. The paper surveys how LLMs close capability gaps and what technical, data, and ethical barriers remain for real-world use.
Main Contribution
Comprehensive survey of LLM capabilities, data, training methods, and task performance in healthcare.
Comparison between PLMs and LLMs across common clinical NLP tasks, highlighting when each is preferable.
Key Findings
Top LLMs approach human performance on exam-style medical questions.
Instruction fine-tuning (SFT) is the most common method to adapt LLMs for medicine.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4 86.7%, Med‑PaLM 2 86.5%, Human 87.0% | FT BERT 44.62% | — | USMLE (exam-style) | Table 4: model vs human exam scores | Table 4 |
| Accuracy | GPT-4 73.66%, Med‑PaLM 2 72.30%, Aloe-Alpha 64.47% | FT BERT 43.03% | — | MedMCQA (multiple-choice medical questions) | Table 4: benchmark scores reported by survey | Table 4 |
What To Try In 7 Days
Run a focused pilot: use an LLM (GPT-4 or an open LLaMA-based model) with retrieval (RAG) on 100 real questions from your domain to measure accuracy and hallucination rate.
Assemble 500–2,000 in-domain instruction/QA examples and fine-tune (LoRA or SFT) a small LLM to test performance improvement versus prompts only.
Run a quick bias and privacy risk audit on your training/evaluation data and add simple mitigation (resampling / importance weighting) for underrepresented groups.
Optimization Features
Infra Optimization
Model Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Hallucinations: models can fabricate facts or citations.
Data privacy and leakage risks from EHR-trained models.
When Not To Use
Automated, unsupervised clinical diagnosis without clinician review.
When patient data privacy cannot be guaranteed.
Failure Modes
Confident but incorrect answers (hallucination).
Systematic underdiagnosis for underrepresented subgroups.

