GPT‑4 can extract why patients switch contraceptives from clinical notes and reveal group-level disparities

Overview

Decision SnapshotNeeds Validation

The approach is practical and outperforming baselines on this dataset, but relies on proprietary GPT‑4, a single‑center cohort, and manual validation to catch hallucinations.

Citations5

Evidence Strength0.75

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/7

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Brenda Y. Miao, Christopher YK Williams, Ebenezer Chinedu-Eneh, Travis Zack, Emily Alsentzer, Atul J. Butte, Irene Y. Chen

Links

Abstract / PDF / Code

Why It Matters For Business

Zero‑shot LLM extraction can unlock reasons for medication changes from notes quickly, lowering annotation costs and enabling equity and quality analyses across patient subgroups.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

The authors used GPT‑4 (zero-shot, HIPAA‑compliant Azure API) on deidentified clinical notes from UCSF to extract which contraceptive was stopped, which was started, and the free-text reason for switching. They validated prompts on 93 manually annotated notes, then applied the best prompt to 1,964 switches from 1,515 patients. GPT‑4 achieved high correctness on extracted reasons (91.4% accuracy, 2.2% hallucination) and good extraction for 'started' contraceptives; it outperformed BOW/TF‑IDF, random forest, and a clinical BERT baseline. Topic clustering of extracted reasons found common causes (bleeding, patient preference, forgetting pills, adverse events, insurance) and linked insurance and

Problem Statement

Reasons for switching contraceptives are often written only in free-text clinical notes. Manually labeling notes or training custom models is slow. The paper asks: can a general LLM (GPT‑4) extract started/stopped contraceptives and the reason for switching without task-specific training, and can those extractions reveal subgroup differences?

Main Contribution

Demonstrate zero-shot GPT‑4 can extract contraceptive started/stopped and free-text reasons from clinical notes with high accuracy on a held-out, manually annotated set.

Provide a prompt-development and evaluation pipeline comparing six prompts and multiple baselines (logistic regression, random forest, UCSF‑BERT).

Key Findings

GPT‑4 correctly extracted reasons for switching on manual review.

NumbersAccuracy 91.4%; hallucination rate 2.2% (n=93)

Practical UseYou can use GPT‑4 zero‑shot to pull reasons for medication changes from notes with low hallucination; validate on a small annotated set before scaling.

Evidence RefAbstract; Human evaluation (Results)

GPT‑4 extracted contraceptives started/stopped with high micro‑F1 on the development set but lower stop performance at scale.

NumbersDev microF1 started 0.849 / stopped 0.881; Test microF1 started 0.828 / stopped 0.439

Practical UseExpect strong extraction for medication starts; confirm stop detection quality in your data, since stop extraction may drop on noisier, larger sets.

Evidence RefPrompt evaluation and full test comparison (Results; Figure 2–3)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Contraceptive switches (cohort)	1,964 switches from 1,515 patients	—	—	UCSF cohort (2012–2023)	Results: cohort selection and filtering; Figure 1	Results
GPT‑4 microF1 (development set)	Started 0.849; Stopped 0.881	—	—	Held-out 5% manually annotated notes (n=93)	Prompt evaluation (Figure 2; Results)	Results

What To Try In 7 Days

Run the authors' prompt on a small sample of your deidentified notes and manually check 50 examples.

Compare GPT‑4 extractions to your structured medication records to find documentation gaps.

Cluster the extracted free‑text reasons (BERTopic or similar) to surface common operational issues like insurance gaps.

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/BMiao10/contraceptive-switching

Risks & Boundaries

Limitations

Single academic center dataset; results may not generalize to other hospitals or documentation styles.

Clinical notes were deidentified and some brand names were incorrectly redacted, which can harm extraction accuracy.

When Not To Use

When you require fully open‑source models for regulatory or reproducibility reasons.

When clinical notes are extremely short or heavily redacted for privacy such that reasons are not present.

Failure Modes

Hallucination: model can invent reasons not present (observed 2.2% in manual review).

Missed mentions: stops may not be documented, lowering stop extraction reliability on large noisy sets.

Core Entities

Models

GPT-4 (Microsoft Azure HIPAA instance)UCSF-BERTRandom ForestLogistic Regression

Metrics

MicroF1AccuracyHallucination rateCohen's Kappa

Datasets

UCSF Information Commons deidentified clinical notes

Context Entities

Models

GPT-4 (proprietary, training details not public)

Metrics

MicroF1 for medication start/stopTopic enrichment scores

Datasets

Deidentified clinician notes; not publicly available

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT‑4 correctly extracted reasons for switching on manual review.

GPT‑4 extracted contraceptives started/stopped with high micro‑F1 on the development set but lower stop performance at scale.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding