Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.5
Citation Count
5
Why It Matters For Business
Zero‑shot LLM extraction can unlock reasons for medication changes from notes quickly, lowering annotation costs and enabling equity and quality analyses across patient subgroups.
Summary TLDR
The authors used GPT‑4 (zero-shot, HIPAA‑compliant Azure API) on deidentified clinical notes from UCSF to extract which contraceptive was stopped, which was started, and the free-text reason for switching. They validated prompts on 93 manually annotated notes, then applied the best prompt to 1,964 switches from 1,515 patients. GPT‑4 achieved high correctness on extracted reasons (91.4% accuracy, 2.2% hallucination) and good extraction for 'started' contraceptives; it outperformed BOW/TF‑IDF, random forest, and a clinical BERT baseline. Topic clustering of extracted reasons found common causes (bleeding, patient preference, forgetting pills, adverse events, insurance) and linked insurance and
Problem Statement
Reasons for switching contraceptives are often written only in free-text clinical notes. Manually labeling notes or training custom models is slow. The paper asks: can a general LLM (GPT‑4) extract started/stopped contraceptives and the reason for switching without task-specific training, and can those extractions reveal subgroup differences?
Main Contribution
Demonstrate zero-shot GPT‑4 can extract contraceptive started/stopped and free-text reasons from clinical notes with high accuracy on a held-out, manually annotated set.
Provide a prompt-development and evaluation pipeline comparing six prompts and multiple baselines (logistic regression, random forest, UCSF‑BERT).
Cluster GPT‑4 extracted reasons with BERTopic to identify common reasons and show demographic enrichment (e.g., insurance, weight/mood concerns) in specific race/ethnicity groups.
Key Findings
GPT‑4 correctly extracted reasons for switching on manual review.
GPT‑4 extracted contraceptives started/stopped with high micro‑F1 on the development set but lower stop performance at scale.
LLM zero‑shot outperformed traditional baselines trained on structured silver labels.
Scope and size of the cohort used for analysis.
Certain reasons for switching are more common in specific race/ethnicity groups.
Results
Contraceptive switches (cohort)
GPT‑4 microF1 (development set)
GPT‑4 microF1 (test set)
Reason extraction correctness
Hallucination rate
Baseline classifier best microF1 (start)
Demographic differences (example)
Who Should Care
What To Try In 7 Days
Run the authors' prompt on a small sample of your deidentified notes and manually check 50 examples.
Compare GPT‑4 extractions to your structured medication records to find documentation gaps.
Cluster the extracted free‑text reasons (BERTopic or similar) to surface common operational issues like insurance gaps.
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single academic center dataset; results may not generalize to other hospitals or documentation styles.
- Clinical notes were deidentified and some brand names were incorrectly redacted, which can harm extraction accuracy.
- GPT‑4 is proprietary; training data and model internals are not public, limiting interpretability.
- Structured medication labels (silver labels) were noisy and disagree with human notes, complicating baseline training.
When Not To Use
- When you require fully open‑source models for regulatory or reproducibility reasons.
- When clinical notes are extremely short or heavily redacted for privacy such that reasons are not present.
- When you need provable causal inference rather than descriptive extraction of reasons.
Failure Modes
- Hallucination: model can invent reasons not present (observed 2.2% in manual review).
- Missed mentions: stops may not be documented, lowering stop extraction reliability on large noisy sets.
- Redaction errors: deidentification may remove brand names or key terms and reduce accuracy.
- Bias in documentation: clinical note language varies by clinician and institution, affecting topic clusters.
Core Entities
Models
- GPT-4 (Microsoft Azure HIPAA instance)
- UCSF-BERT
- Random Forest
- Logistic Regression
Metrics
- MicroF1
- Accuracy
- Hallucination rate
- Cohen's Kappa
Datasets
- UCSF Information Commons deidentified clinical notes
Context Entities
Models
- GPT-4 (proprietary, training details not public)
Metrics
- MicroF1 for medication start/stop
- Topic enrichment scores
Datasets
- Deidentified clinician notes; not publicly available

