Systematic benchmark: GPT-series and LLaMA variants vs. fine-tuned BioNLP models across 12 biomedical tasks

May 10, 20238 min

Overview

Decision SnapshotReady For Pilot

LLMs like GPT-4 show reliable reasoning on question-answering tasks, but extraction and classification still favor fine-tuned domain models; cost and hallucination risk reduce readiness for unsupervised deployment.

Citations41

Evidence Strength0.85

Confidence0.87

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B. Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, Vipina Kuttichi Keloth, Kalpana Raja, Jiming Huang, Huan He, Fongci Lin, Jingcheng Du, Rui Zhang, W. Jim Zheng, Ron A. Adelman, Zhiyong Lu, Hua Xu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need high-accuracy extraction or classification in biomedical text, fine-tuned domain models remain the practical choice; use GPT-4 for reasoning or prototyping high-level QA but budget for much higher inference costs and add output validation.

Who Should Care

Summary TLDR

This paper runs a head-to-head evaluation of four representative LLMs (GPT-3.5, GPT-4, LLaMA 2 13B, PMC-LLaMA 13B) on 12 biomedical NLP benchmarks across six task families (NER, relation extraction, multi-label classification, question answering, summarization, simplification). Key results: traditional fine-tuned biomedical models remain best for extraction and classification (macro-average ~0.65 vs. best LLM zero/few-shot ~0.51). GPT-4 excels at reasoning-style QA (e.g., MedQA ~0.72 accuracy vs. SOTA ~0.42) but is 60–100x more expensive than GPT-3.5. Open-source LLaMA variants need fine-tuning to reach competitive performance. Manual review found frequent missing, inconsistent, and hallucIN

Problem Statement

Can general-purpose LLMs replace or complement fine-tuned biomedical models across common BioNLP tasks? The paper tests zero-shot, few-shot, dynamic K-nearest few-shot, and fine-tuning settings across 12 benchmarks and inspects error types (missing output, inconsistent formats, hallucinations) and cost trade-offs.

Main Contribution

Broad empirical benchmark of GPT-3.5, GPT-4, LLaMA 2 13B, and PMC-LLaMA 13B on 12 biomedical datasets covering six task types under zero-shot, static few-shot, dynamic K-nearest few-shot, and fine-tuning settings.

Large-scale qualitative audit of hundreds of thousands of raw LLM outputs to categorize missing outputs, inconsistent formatting, and hallucinations; manual scoring of summarization outputs for accuracy, completeness, and readability.

Key Findings

Fine-tuned, domain-specific models still outperform zero- and few-shot LLMs on most BioNLP tasks.

NumbersMacro-average: SOTA fine-tuned 0.6536 vs. best LLM zero/few-shot ~0.51

Practical UseFor extraction and classification pipelines, prioritize fine-tuning PubMedBERT/BioBERT/BART variants when labeled data is available; reserve LLM zero/few-shot for tasks where retraining is infeasible.

Evidence RefTable 3, main text (macro-average comparison)

GPT-4 strongly outperforms other models on reasoning-heavy medical question answering.

NumbersMedQA accuracy: GPT-4 0.7156 vs. SOTA fine-tuned 0.4195 (≈+0.30 abs)

Practical UseUse GPT-4 (zero/few-shot) for QA-style tasks or prototyping complex reasoning pipelines; validate outputs before clinical use.

Evidence RefTable 3, MedQA results

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Macro-average (12 datasets)SOTA fine-tuned 0.6536; GPT-4 zero-shot 0.4561; best LLM few-shot ~0.4750SOTA fine-tunedSOTA ~0.15 higher than best zero/few-shot LLMAggregate across 12 benchmarksTable 3 macro-average comparisonTable 3
AccuracyGPT-4 0.7156; GPT-3.5 0.4988; SOTA fine-tuned 0.4195SOTA fine-tunedGPT-4 +0.296 vs SOTAMedQA (5-option)Table 3 MedQA rowTable 3

What To Try In 7 Days

Run GPT-4 zero-shot on a small QA sample to gauge reasoning gains vs. your baseline.

Test dynamic K-nearest few-shot for document-level classification and compare to static one-shot.

Fine-tune a PubMedBERT/BioBERT model on a small labeled extraction task and measure lift versus LLM zero-shot outputs with manual spot checks for hallucinations.

Optimization Features

Training Optimization
LoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmarks and metrics are biased toward supervised tasks; may underrate LLM benefits in underrepresented reasoning tasks.

Open-source LLaMA variants evaluated at 13B only; larger or newer open models may differ.

When Not To Use

Do not rely on zero/few-shot LLM outputs for high-stakes extraction tasks without validation.

Avoid GPT-4 for large-scale batch inference when cost is prohibitive unless its accuracy gain justifies expense.

Failure Modes

Missing output (no answer returned)

Inconsistent output formatting that breaks automatic parsers

Core Entities

Models

GPT-3.5 (gpt-3.5-turbo-16k-0613)GPT-4 (gpt-4-0613, gpt-4-32k-0613)LLaMA 2 13BPMC-LLaMA 13BBioBERTPubMedBERTBART

Metrics

Macro F1Micro F1Entity-level F1AccuracyROUGE-LBERTScoreBARTScoreReadability (FKGL, DCRS)

Datasets

BC5CDR-chemicalNCBI-diseaseChemProtDDI2013HoCLitCovidMedQAPubMedQAPubMed Text SummarizationMS^2Cochrane PLSPLOS Text Simplification

Benchmarks

12-dataset BioNLP benchmark (6 task families)

Context Entities

Models

BioGPTBioMedLMPMC-LLaMA 7B/13BMeditron

Metrics

Accuracy

Datasets

PubMedPMC articlesUSMLE-style question sets

Benchmarks

MedQA, PubMedQA (as reasoning-focused datasets)