Overview
LLMs like GPT-4 show reliable reasoning on question-answering tasks, but extraction and classification still favor fine-tuned domain models; cost and hallucination risk reduce readiness for unsupervised deployment.
Citations41
Evidence Strength0.85
Confidence0.87
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
If you need high-accuracy extraction or classification in biomedical text, fine-tuned domain models remain the practical choice; use GPT-4 for reasoning or prototyping high-level QA but budget for much higher inference costs and add output validation.
Who Should Care
Summary TLDR
This paper runs a head-to-head evaluation of four representative LLMs (GPT-3.5, GPT-4, LLaMA 2 13B, PMC-LLaMA 13B) on 12 biomedical NLP benchmarks across six task families (NER, relation extraction, multi-label classification, question answering, summarization, simplification). Key results: traditional fine-tuned biomedical models remain best for extraction and classification (macro-average ~0.65 vs. best LLM zero/few-shot ~0.51). GPT-4 excels at reasoning-style QA (e.g., MedQA ~0.72 accuracy vs. SOTA ~0.42) but is 60–100x more expensive than GPT-3.5. Open-source LLaMA variants need fine-tuning to reach competitive performance. Manual review found frequent missing, inconsistent, and hallucIN
Problem Statement
Can general-purpose LLMs replace or complement fine-tuned biomedical models across common BioNLP tasks? The paper tests zero-shot, few-shot, dynamic K-nearest few-shot, and fine-tuning settings across 12 benchmarks and inspects error types (missing output, inconsistent formats, hallucinations) and cost trade-offs.
Main Contribution
Broad empirical benchmark of GPT-3.5, GPT-4, LLaMA 2 13B, and PMC-LLaMA 13B on 12 biomedical datasets covering six task types under zero-shot, static few-shot, dynamic K-nearest few-shot, and fine-tuning settings.
Large-scale qualitative audit of hundreds of thousands of raw LLM outputs to categorize missing outputs, inconsistent formatting, and hallucinations; manual scoring of summarization outputs for accuracy, completeness, and readability.
Key Findings
Fine-tuned, domain-specific models still outperform zero- and few-shot LLMs on most BioNLP tasks.
GPT-4 strongly outperforms other models on reasoning-heavy medical question answering.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Macro-average (12 datasets) | SOTA fine-tuned 0.6536; GPT-4 zero-shot 0.4561; best LLM few-shot ~0.4750 | SOTA fine-tuned | SOTA ~0.15 higher than best zero/few-shot LLM | Aggregate across 12 benchmarks | Table 3 macro-average comparison | Table 3 |
| Accuracy | GPT-4 0.7156; GPT-3.5 0.4988; SOTA fine-tuned 0.4195 | SOTA fine-tuned | GPT-4 +0.296 vs SOTA | MedQA (5-option) | Table 3 MedQA row | Table 3 |
What To Try In 7 Days
Run GPT-4 zero-shot on a small QA sample to gauge reasoning gains vs. your baseline.
Test dynamic K-nearest few-shot for document-level classification and compare to static one-shot.
Fine-tune a PubMedBERT/BioBERT model on a small labeled extraction task and measure lift versus LLM zero-shot outputs with manual spot checks for hallucinations.
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Benchmarks and metrics are biased toward supervised tasks; may underrate LLM benefits in underrepresented reasoning tasks.
Open-source LLaMA variants evaluated at 13B only; larger or newer open models may differ.
When Not To Use
Do not rely on zero/few-shot LLM outputs for high-stakes extraction tasks without validation.
Avoid GPT-4 for large-scale batch inference when cost is prohibitive unless its accuracy gain justifies expense.
Failure Modes
Missing output (no answer returned)
Inconsistent output formatting that breaks automatic parsers

