Overview
The approach is practical and outperforming baselines on this dataset, but relies on proprietary GPT‑4, a single‑center cohort, and manual validation to catch hallucinations.
Citations5
Evidence Strength0.75
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/7
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
Zero‑shot LLM extraction can unlock reasons for medication changes from notes quickly, lowering annotation costs and enabling equity and quality analyses across patient subgroups.
Who Should Care
Summary TLDR
The authors used GPT‑4 (zero-shot, HIPAA‑compliant Azure API) on deidentified clinical notes from UCSF to extract which contraceptive was stopped, which was started, and the free-text reason for switching. They validated prompts on 93 manually annotated notes, then applied the best prompt to 1,964 switches from 1,515 patients. GPT‑4 achieved high correctness on extracted reasons (91.4% accuracy, 2.2% hallucination) and good extraction for 'started' contraceptives; it outperformed BOW/TF‑IDF, random forest, and a clinical BERT baseline. Topic clustering of extracted reasons found common causes (bleeding, patient preference, forgetting pills, adverse events, insurance) and linked insurance and
Problem Statement
Reasons for switching contraceptives are often written only in free-text clinical notes. Manually labeling notes or training custom models is slow. The paper asks: can a general LLM (GPT‑4) extract started/stopped contraceptives and the reason for switching without task-specific training, and can those extractions reveal subgroup differences?
Main Contribution
Demonstrate zero-shot GPT‑4 can extract contraceptive started/stopped and free-text reasons from clinical notes with high accuracy on a held-out, manually annotated set.
Provide a prompt-development and evaluation pipeline comparing six prompts and multiple baselines (logistic regression, random forest, UCSF‑BERT).
Key Findings
GPT‑4 correctly extracted reasons for switching on manual review.
GPT‑4 extracted contraceptives started/stopped with high micro‑F1 on the development set but lower stop performance at scale.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Contraceptive switches (cohort) | 1,964 switches from 1,515 patients | — | — | UCSF cohort (2012–2023) | Results: cohort selection and filtering; Figure 1 | Results |
| GPT‑4 microF1 (development set) | Started 0.849; Stopped 0.881 | — | — | Held-out 5% manually annotated notes (n=93) | Prompt evaluation (Figure 2; Results) | Results |
What To Try In 7 Days
Run the authors' prompt on a small sample of your deidentified notes and manually check 50 examples.
Compare GPT‑4 extractions to your structured medication records to find documentation gaps.
Cluster the extracted free‑text reasons (BERTopic or similar) to surface common operational issues like insurance gaps.
Reproducibility
Risks & Boundaries
Limitations
Single academic center dataset; results may not generalize to other hospitals or documentation styles.
Clinical notes were deidentified and some brand names were incorrectly redacted, which can harm extraction accuracy.
When Not To Use
When you require fully open‑source models for regulatory or reproducibility reasons.
When clinical notes are extremely short or heavily redacted for privacy such that reasons are not present.
Failure Modes
Hallucination: model can invent reasons not present (observed 2.2% in manual review).
Missed mentions: stops may not be documented, lowering stop extraction reliability on large noisy sets.

