Which prompt styles work best for zero-shot clinical NLP across GPT‑3.5, BARD, and LLAMA2

September 14, 20236 min

Overview

Production Readiness

0.5

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

10

Authors

Sonish Sivarajkumar, Mark Kelley, Alyssa Samolyk-Mazzanti, Shyam Visweswaran, Yanshan Wang

Links

Abstract / PDF

Why It Matters For Business

Prompt choice can cut or save labeling costs: a well-crafted zero-shot prompt often gets near supervised accuracy, reducing the need for costly annotations.

Summary TLDR

The paper tests seven prompt styles (prefix, cloze, anticipatory, chain-of-thought, heuristic, ensemble, few-shot variants) across five clinical NLP tasks and three LLMs (GPT‑3.5, BARD/PALM‑2, LLAMA2). Heuristic and chain-of-thought prompts frequently gave the best zero-shot accuracy; GPT‑3.5 was generally strongest. Few-shot (2 examples) often helps, but well-designed zero-shot heuristics beat few-shot in some tasks. Results are reported using accuracy on CASI and EBM‑NLP samples (Table 3).

Problem Statement

Clinical NLP lacks labeled data. The paper asks: which prompt designs reliably guide modern LLMs to do five common clinical extraction tasks without task-specific training?

Main Contribution

Systematic comparison of seven prompt types for five clinical NLP tasks in zero/few-shot settings.

Introduction of heuristic prompts (rule-driven) and ensemble prompts (majority vote over prompt types).

Practical guidelines mapping task types to prompt types and recommending models per prompt.

Key Findings

Heuristic prompts gave the top zero-shot accuracy for clinical sense disambiguation.

NumbersGPT‑3.5 heuristic accuracy = 0.96

Chain-of-thought and heuristic prompts performed best for biomedical evidence extraction.

NumbersGPT‑3.5 CoT/heuristic accuracy = 0.94; few-shot = 0.96

Chain-of-thought excelled at coreference resolution.

NumbersGPT‑3.5 CoT accuracy = 0.94

Few-shot prompting usually improves accuracy but not always.

NumbersMany tasks: few-shot > zero-shot; example EBM few-shot 0.96 vs zero-shot 0.94

Model choice matters: GPT‑3.5 outperformed BARD and LLAMA2 on most tasks.

NumbersExample: clinical sense disamb. GPT‑3.5 0.96 vs BARD prefix 0.76 (∆=0.20)

Results

Accuracy

Value0.96 (heuristic, GPT-3.5)

Accuracy

Value0.94 (heuristic/CoT, GPT-3.5) / 0.96 (few-shot, GPT-3.5)

Accuracy

Value0.94 (chain-of-thought, GPT-3.5)

Accuracy

Value0.96 (heuristic/CoT or few-shot, GPT-3.5)

Ensemble (majority vote) performance

ValueUsually second-best; example Clinical sense 0.90 (GPT-3.5)

Who Should Care

What To Try In 7 Days

Run a quick pilot: test heuristic prompts on one classification task with GPT‑3.5

Try chain-of-thought prompts for one coreference or relation task

Add 1–2 representative examples (few-shot) and compare accuracy delta vs zero-shot

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Possible training-data leakage: LLMs' pretraining sources are unknown, so high accuracy may reflect prior exposure to evaluation data.
  • Evaluation uses small sampled subsets (CASI, EBM‑NLP samples), limiting generality.
  • Prompt space explored is large but not exhaustive; iterative manual tuning remains time-consuming.

When Not To Use

  • For high-stakes clinical decisions without human review
  • When labeled data and finetuning are available and affordable
  • When regulatory traceability requires deterministic, auditable systems

Failure Modes

  • Inconsistent outputs due to LLM randomness
  • Hallucinated or incorrect extractions in noisy notes
  • Ensemble majority vote introducing ambiguity for precise tasks
  • Prompt brittleness when context shifts or notation changes

Core Entities

Models

  • GPT-3.5
  • BARD (PALM-2)
  • LLAMA2

Metrics

  • Accuracy

Datasets

  • CASI
  • EBM-NLP