Overview
Production Readiness
0.5
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
10
Why It Matters For Business
Prompt choice can cut or save labeling costs: a well-crafted zero-shot prompt often gets near supervised accuracy, reducing the need for costly annotations.
Summary TLDR
The paper tests seven prompt styles (prefix, cloze, anticipatory, chain-of-thought, heuristic, ensemble, few-shot variants) across five clinical NLP tasks and three LLMs (GPT‑3.5, BARD/PALM‑2, LLAMA2). Heuristic and chain-of-thought prompts frequently gave the best zero-shot accuracy; GPT‑3.5 was generally strongest. Few-shot (2 examples) often helps, but well-designed zero-shot heuristics beat few-shot in some tasks. Results are reported using accuracy on CASI and EBM‑NLP samples (Table 3).
Problem Statement
Clinical NLP lacks labeled data. The paper asks: which prompt designs reliably guide modern LLMs to do five common clinical extraction tasks without task-specific training?
Main Contribution
Systematic comparison of seven prompt types for five clinical NLP tasks in zero/few-shot settings.
Introduction of heuristic prompts (rule-driven) and ensemble prompts (majority vote over prompt types).
Practical guidelines mapping task types to prompt types and recommending models per prompt.
Key Findings
Heuristic prompts gave the top zero-shot accuracy for clinical sense disambiguation.
Chain-of-thought and heuristic prompts performed best for biomedical evidence extraction.
Chain-of-thought excelled at coreference resolution.
Few-shot prompting usually improves accuracy but not always.
Model choice matters: GPT‑3.5 outperformed BARD and LLAMA2 on most tasks.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Ensemble (majority vote) performance
Who Should Care
What To Try In 7 Days
Run a quick pilot: test heuristic prompts on one classification task with GPT‑3.5
Try chain-of-thought prompts for one coreference or relation task
Add 1–2 representative examples (few-shot) and compare accuracy delta vs zero-shot
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Possible training-data leakage: LLMs' pretraining sources are unknown, so high accuracy may reflect prior exposure to evaluation data.
- Evaluation uses small sampled subsets (CASI, EBM‑NLP samples), limiting generality.
- Prompt space explored is large but not exhaustive; iterative manual tuning remains time-consuming.
When Not To Use
- For high-stakes clinical decisions without human review
- When labeled data and finetuning are available and affordable
- When regulatory traceability requires deterministic, auditable systems
Failure Modes
- Inconsistent outputs due to LLM randomness
- Hallucinated or incorrect extractions in noisy notes
- Ensemble majority vote introducing ambiguity for precise tasks
- Prompt brittleness when context shifts or notation changes
Core Entities
Models
- GPT-3.5
- BARD (PALM-2)
- LLAMA2
Metrics
- Accuracy
Datasets
- CASI
- EBM-NLP

