GPT‑4 can extract why patients switch contraceptives from clinical notes and reveal group-level disparities

February 6, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.5

Citation Count

5

Authors

Brenda Y. Miao, Christopher YK Williams, Ebenezer Chinedu-Eneh, Travis Zack, Emily Alsentzer, Atul J. Butte, Irene Y. Chen

Links

Abstract / PDF

Why It Matters For Business

Zero‑shot LLM extraction can unlock reasons for medication changes from notes quickly, lowering annotation costs and enabling equity and quality analyses across patient subgroups.

Summary TLDR

The authors used GPT‑4 (zero-shot, HIPAA‑compliant Azure API) on deidentified clinical notes from UCSF to extract which contraceptive was stopped, which was started, and the free-text reason for switching. They validated prompts on 93 manually annotated notes, then applied the best prompt to 1,964 switches from 1,515 patients. GPT‑4 achieved high correctness on extracted reasons (91.4% accuracy, 2.2% hallucination) and good extraction for 'started' contraceptives; it outperformed BOW/TF‑IDF, random forest, and a clinical BERT baseline. Topic clustering of extracted reasons found common causes (bleeding, patient preference, forgetting pills, adverse events, insurance) and linked insurance and

Problem Statement

Reasons for switching contraceptives are often written only in free-text clinical notes. Manually labeling notes or training custom models is slow. The paper asks: can a general LLM (GPT‑4) extract started/stopped contraceptives and the reason for switching without task-specific training, and can those extractions reveal subgroup differences?

Main Contribution

Demonstrate zero-shot GPT‑4 can extract contraceptive started/stopped and free-text reasons from clinical notes with high accuracy on a held-out, manually annotated set.

Provide a prompt-development and evaluation pipeline comparing six prompts and multiple baselines (logistic regression, random forest, UCSF‑BERT).

Cluster GPT‑4 extracted reasons with BERTopic to identify common reasons and show demographic enrichment (e.g., insurance, weight/mood concerns) in specific race/ethnicity groups.

Key Findings

GPT‑4 correctly extracted reasons for switching on manual review.

NumbersAccuracy 91.4%; hallucination rate 2.2% (n=93)

GPT‑4 extracted contraceptives started/stopped with high micro‑F1 on the development set but lower stop performance at scale.

NumbersDev microF1 started 0.849 / stopped 0.881; Test microF1 started 0.828 / stopped 0.439

LLM zero‑shot outperformed traditional baselines trained on structured silver labels.

NumbersBest baseline RF (TF‑IDF) start 0.714 vs GPT‑4 start 0.828

Scope and size of the cohort used for analysis.

Numbers1,964 contraceptive switches from 1,515 patients (from 20,283 patients sampled)

Certain reasons for switching are more common in specific race/ethnicity groups.

NumbersBlack/AA and Latinx groups showed enrichment for 'insurance coverage'; Latinx and 'Other' showed enrichment for 'weight‑

Results

Contraceptive switches (cohort)

Value1,964 switches from 1,515 patients

GPT‑4 microF1 (development set)

ValueStarted 0.849; Stopped 0.881

GPT‑4 microF1 (test set)

ValueStarted 0.828; Stopped 0.439

Reason extraction correctness

Value91.4% correct

Hallucination rate

Value2.2%

Baseline classifier best microF1 (start)

ValueRandom forest (TF‑IDF) 0.714

Demographic differences (example)

ValueBlack/AA: 19.3% with switch vs 8.2% without; mean age switch 25.9 vs 29.1

Who Should Care

What To Try In 7 Days

Run the authors' prompt on a small sample of your deidentified notes and manually check 50 examples.

Compare GPT‑4 extractions to your structured medication records to find documentation gaps.

Cluster the extracted free‑text reasons (BERTopic or similar) to surface common operational issues like insurance gaps.

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single academic center dataset; results may not generalize to other hospitals or documentation styles.
  • Clinical notes were deidentified and some brand names were incorrectly redacted, which can harm extraction accuracy.
  • GPT‑4 is proprietary; training data and model internals are not public, limiting interpretability.
  • Structured medication labels (silver labels) were noisy and disagree with human notes, complicating baseline training.

When Not To Use

  • When you require fully open‑source models for regulatory or reproducibility reasons.
  • When clinical notes are extremely short or heavily redacted for privacy such that reasons are not present.
  • When you need provable causal inference rather than descriptive extraction of reasons.

Failure Modes

  • Hallucination: model can invent reasons not present (observed 2.2% in manual review).
  • Missed mentions: stops may not be documented, lowering stop extraction reliability on large noisy sets.
  • Redaction errors: deidentification may remove brand names or key terms and reduce accuracy.
  • Bias in documentation: clinical note language varies by clinician and institution, affecting topic clusters.

Core Entities

Models

  • GPT-4 (Microsoft Azure HIPAA instance)
  • UCSF-BERT
  • Random Forest
  • Logistic Regression

Metrics

  • MicroF1
  • Accuracy
  • Hallucination rate
  • Cohen's Kappa

Datasets

  • UCSF Information Commons deidentified clinical notes

Context Entities

Models

  • GPT-4 (proprietary, training details not public)

Metrics

  • MicroF1 for medication start/stop
  • Topic enrichment scores

Datasets

  • Deidentified clinician notes; not publicly available