Overview
The experiments are practical and actionable, but sample sizes and potential training-data leakage reduce confidence for direct clinical deployment without further validation.
Citations10
Evidence Strength0.60
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/5
Reproducibility
Status: No open assets linked
Open source: No
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 40%
Why It Matters For Business
Prompt choice can cut or save labeling costs: a well-crafted zero-shot prompt often gets near supervised accuracy, reducing the need for costly annotations.
Who Should Care
Summary TLDR
The paper tests seven prompt styles (prefix, cloze, anticipatory, chain-of-thought, heuristic, ensemble, few-shot variants) across five clinical NLP tasks and three LLMs (GPT‑3.5, BARD/PALM‑2, LLAMA2). Heuristic and chain-of-thought prompts frequently gave the best zero-shot accuracy; GPT‑3.5 was generally strongest. Few-shot (2 examples) often helps, but well-designed zero-shot heuristics beat few-shot in some tasks. Results are reported using accuracy on CASI and EBM‑NLP samples (Table 3).
Problem Statement
Clinical NLP lacks labeled data. The paper asks: which prompt designs reliably guide modern LLMs to do five common clinical extraction tasks without task-specific training?
Main Contribution
Systematic comparison of seven prompt types for five clinical NLP tasks in zero/few-shot settings.
Introduction of heuristic prompts (rule-driven) and ensemble prompts (majority vote over prompt types).
Key Findings
Heuristic prompts gave the top zero-shot accuracy for clinical sense disambiguation.
Chain-of-thought and heuristic prompts performed best for biomedical evidence extraction.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 0.96 (heuristic, GPT-3.5) | — | vs BARD prefix 0.76 (∆=+0.20) | CASI subset | Table 3 shows 0.96 for GPT‑3.5 heuristic | Table 3 |
| Accuracy | 0.94 (heuristic/CoT, GPT-3.5) / 0.96 (few-shot, GPT-3.5) | — | few-shot +0.02 vs zero-shot | EBM-NLP | Table 3 reports 0.94 zero-shot and 0.96 few-shot for GPT‑3.5 | Table 3 |
What To Try In 7 Days
Run a quick pilot: test heuristic prompts on one classification task with GPT‑3.5
Try chain-of-thought prompts for one coreference or relation task
Add 1–2 representative examples (few-shot) and compare accuracy delta vs zero-shot
Reproducibility
Risks & Boundaries
Limitations
Possible training-data leakage: LLMs' pretraining sources are unknown, so high accuracy may reflect prior exposure to evaluation data.
Evaluation uses small sampled subsets (CASI, EBM‑NLP samples), limiting generality.
When Not To Use
For high-stakes clinical decisions without human review
When labeled data and finetuning are available and affordable
Failure Modes
Inconsistent outputs due to LLM randomness
Hallucinated or incorrect extractions in noisy notes

