Which prompt styles work best for zero-shot clinical NLP across GPT‑3.5, BARD, and LLAMA2

Overview

Decision SnapshotNeeds Validation

The experiments are practical and actionable, but sample sizes and potential training-data leakage reduce confidence for direct clinical deployment without further validation.

Citations10

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 40%

Authors

Sonish Sivarajkumar, Mark Kelley, Alyssa Samolyk-Mazzanti, Shyam Visweswaran, Yanshan Wang

Links

Abstract / PDF

Why It Matters For Business

Prompt choice can cut or save labeling costs: a well-crafted zero-shot prompt often gets near supervised accuracy, reducing the need for costly annotations.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The paper tests seven prompt styles (prefix, cloze, anticipatory, chain-of-thought, heuristic, ensemble, few-shot variants) across five clinical NLP tasks and three LLMs (GPT‑3.5, BARD/PALM‑2, LLAMA2). Heuristic and chain-of-thought prompts frequently gave the best zero-shot accuracy; GPT‑3.5 was generally strongest. Few-shot (2 examples) often helps, but well-designed zero-shot heuristics beat few-shot in some tasks. Results are reported using accuracy on CASI and EBM‑NLP samples (Table 3).

Problem Statement

Clinical NLP lacks labeled data. The paper asks: which prompt designs reliably guide modern LLMs to do five common clinical extraction tasks without task-specific training?

Main Contribution

Systematic comparison of seven prompt types for five clinical NLP tasks in zero/few-shot settings.

Introduction of heuristic prompts (rule-driven) and ensemble prompts (majority vote over prompt types).

Key Findings

Heuristic prompts gave the top zero-shot accuracy for clinical sense disambiguation.

NumbersGPT‑3.5 heuristic accuracy = 0.96

Practical UseFor classification-style clinical tasks, try concise rule-based (heuristic) prompts first.

Evidence RefTable 3

Chain-of-thought and heuristic prompts performed best for biomedical evidence extraction.

NumbersGPT‑3.5 CoT/heuristic accuracy = 0.94; few-shot = 0.96

Practical UseUse CoT or heuristic prompts for extraction; add 1–2 examples (few-shot) to gain ~2 points when possible.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.96 (heuristic, GPT-3.5)	—	vs BARD prefix 0.76 (∆=+0.20)	CASI subset	Table 3 shows 0.96 for GPT‑3.5 heuristic	Table 3
Accuracy	0.94 (heuristic/CoT, GPT-3.5) / 0.96 (few-shot, GPT-3.5)	—	few-shot +0.02 vs zero-shot	EBM-NLP	Table 3 reports 0.94 zero-shot and 0.96 few-shot for GPT‑3.5	Table 3

What To Try In 7 Days

Run a quick pilot: test heuristic prompts on one classification task with GPT‑3.5

Try chain-of-thought prompts for one coreference or relation task

Add 1–2 representative examples (few-shot) and compare accuracy delta vs zero-shot

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Possible training-data leakage: LLMs' pretraining sources are unknown, so high accuracy may reflect prior exposure to evaluation data.

Evaluation uses small sampled subsets (CASI, EBM‑NLP samples), limiting generality.

When Not To Use

For high-stakes clinical decisions without human review

When labeled data and finetuning are available and affordable

Failure Modes

Inconsistent outputs due to LLM randomness

Hallucinated or incorrect extractions in noisy notes

Core Entities

Models

GPT-3.5BARD (PALM-2)LLAMA2

Metrics

Accuracy

Datasets

CASIEBM-NLP

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Heuristic prompts gave the top zero-shot accuracy for clinical sense disambiguation.

Chain-of-thought and heuristic prompts performed best for biomedical evidence extraction.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

LLM judges are prompt‑sensitive and internally noisy; here's a explainable toolkit to measure and de-noise them

Key finding

SCORE: report accuracy ranges and consistency, not just one score

Key finding

Open-source, reproducible benchmark that compares 10+ LLMs on 20+ tasks and traces the path from GPT-3 to GPT-4

Key finding

KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

Key finding