Which prompt styles work best for zero-shot clinical NLP across GPT‑3.5, BARD, and LLAMA2

September 14, 20236 min

Overview

Decision SnapshotNeeds Validation

The experiments are practical and actionable, but sample sizes and potential training-data leakage reduce confidence for direct clinical deployment without further validation.

Citations10

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 40%

Authors

Sonish Sivarajkumar, Mark Kelley, Alyssa Samolyk-Mazzanti, Shyam Visweswaran, Yanshan Wang

Links

Abstract / PDF

Why It Matters For Business

Prompt choice can cut or save labeling costs: a well-crafted zero-shot prompt often gets near supervised accuracy, reducing the need for costly annotations.

Who Should Care

Summary TLDR

The paper tests seven prompt styles (prefix, cloze, anticipatory, chain-of-thought, heuristic, ensemble, few-shot variants) across five clinical NLP tasks and three LLMs (GPT‑3.5, BARD/PALM‑2, LLAMA2). Heuristic and chain-of-thought prompts frequently gave the best zero-shot accuracy; GPT‑3.5 was generally strongest. Few-shot (2 examples) often helps, but well-designed zero-shot heuristics beat few-shot in some tasks. Results are reported using accuracy on CASI and EBM‑NLP samples (Table 3).

Problem Statement

Clinical NLP lacks labeled data. The paper asks: which prompt designs reliably guide modern LLMs to do five common clinical extraction tasks without task-specific training?

Main Contribution

Systematic comparison of seven prompt types for five clinical NLP tasks in zero/few-shot settings.

Introduction of heuristic prompts (rule-driven) and ensemble prompts (majority vote over prompt types).

Key Findings

Heuristic prompts gave the top zero-shot accuracy for clinical sense disambiguation.

NumbersGPT‑3.5 heuristic accuracy = 0.96

Practical UseFor classification-style clinical tasks, try concise rule-based (heuristic) prompts first.

Evidence RefTable 3

Chain-of-thought and heuristic prompts performed best for biomedical evidence extraction.

NumbersGPT‑3.5 CoT/heuristic accuracy = 0.94; few-shot = 0.96

Practical UseUse CoT or heuristic prompts for extraction; add 1–2 examples (few-shot) to gain ~2 points when possible.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.96 (heuristic, GPT-3.5)vs BARD prefix 0.76 (∆=+0.20)CASI subsetTable 3 shows 0.96 for GPT‑3.5 heuristicTable 3
Accuracy0.94 (heuristic/CoT, GPT-3.5) / 0.96 (few-shot, GPT-3.5)few-shot +0.02 vs zero-shotEBM-NLPTable 3 reports 0.94 zero-shot and 0.96 few-shot for GPT‑3.5Table 3

What To Try In 7 Days

Run a quick pilot: test heuristic prompts on one classification task with GPT‑3.5

Try chain-of-thought prompts for one coreference or relation task

Add 1–2 representative examples (few-shot) and compare accuracy delta vs zero-shot

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Possible training-data leakage: LLMs' pretraining sources are unknown, so high accuracy may reflect prior exposure to evaluation data.

Evaluation uses small sampled subsets (CASI, EBM‑NLP samples), limiting generality.

When Not To Use

For high-stakes clinical decisions without human review

When labeled data and finetuning are available and affordable

Failure Modes

Inconsistent outputs due to LLM randomness

Hallucinated or incorrect extractions in noisy notes

Core Entities

Models

GPT-3.5BARD (PALM-2)LLAMA2

Metrics

Accuracy

Datasets

CASIEBM-NLP