Systematic benchmark: GPT-series and LLaMA variants vs. fine-tuned BioNLP models across 12 biomedical tasks

Overview

Decision SnapshotReady For Pilot

LLMs like GPT-4 show reliable reasoning on question-answering tasks, but extraction and classification still favor fine-tuned domain models; cost and hallucination risk reduce readiness for unsupervised deployment.

Citations41

Evidence Strength0.85

Confidence0.87

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B. Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, Vipina Kuttichi Keloth, Kalpana Raja, Jiming Huang, Huan He, Fongci Lin, Jingcheng Du, Rui Zhang, W. Jim Zheng, Ron A. Adelman, Zhiyong Lu, Hua Xu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need high-accuracy extraction or classification in biomedical text, fine-tuned domain models remain the practical choice; use GPT-4 for reasoning or prototyping high-level QA but budget for much higher inference costs and add output validation.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

This paper runs a head-to-head evaluation of four representative LLMs (GPT-3.5, GPT-4, LLaMA 2 13B, PMC-LLaMA 13B) on 12 biomedical NLP benchmarks across six task families (NER, relation extraction, multi-label classification, question answering, summarization, simplification). Key results: traditional fine-tuned biomedical models remain best for extraction and classification (macro-average ~0.65 vs. best LLM zero/few-shot ~0.51). GPT-4 excels at reasoning-style QA (e.g., MedQA ~0.72 accuracy vs. SOTA ~0.42) but is 60–100x more expensive than GPT-3.5. Open-source LLaMA variants need fine-tuning to reach competitive performance. Manual review found frequent missing, inconsistent, and hallucIN

Problem Statement

Can general-purpose LLMs replace or complement fine-tuned biomedical models across common BioNLP tasks? The paper tests zero-shot, few-shot, dynamic K-nearest few-shot, and fine-tuning settings across 12 benchmarks and inspects error types (missing output, inconsistent formats, hallucinations) and cost trade-offs.

Main Contribution

Broad empirical benchmark of GPT-3.5, GPT-4, LLaMA 2 13B, and PMC-LLaMA 13B on 12 biomedical datasets covering six task types under zero-shot, static few-shot, dynamic K-nearest few-shot, and fine-tuning settings.

Large-scale qualitative audit of hundreds of thousands of raw LLM outputs to categorize missing outputs, inconsistent formatting, and hallucinations; manual scoring of summarization outputs for accuracy, completeness, and readability.

Key Findings

Fine-tuned, domain-specific models still outperform zero- and few-shot LLMs on most BioNLP tasks.

NumbersMacro-average: SOTA fine-tuned 0.6536 vs. best LLM zero/few-shot ~0.51

Practical UseFor extraction and classification pipelines, prioritize fine-tuning PubMedBERT/BioBERT/BART variants when labeled data is available; reserve LLM zero/few-shot for tasks where retraining is infeasible.

Evidence RefTable 3, main text (macro-average comparison)

GPT-4 strongly outperforms other models on reasoning-heavy medical question answering.

NumbersMedQA accuracy: GPT-4 0.7156 vs. SOTA fine-tuned 0.4195 (≈+0.30 abs)

Practical UseUse GPT-4 (zero/few-shot) for QA-style tasks or prototyping complex reasoning pipelines; validate outputs before clinical use.

Evidence RefTable 3, MedQA results

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Macro-average (12 datasets)	SOTA fine-tuned 0.6536; GPT-4 zero-shot 0.4561; best LLM few-shot ~0.4750	SOTA fine-tuned	SOTA ~0.15 higher than best zero/few-shot LLM	Aggregate across 12 benchmarks	Table 3 macro-average comparison	Table 3
Accuracy	GPT-4 0.7156; GPT-3.5 0.4988; SOTA fine-tuned 0.4195	SOTA fine-tuned	GPT-4 +0.296 vs SOTA	MedQA (5-option)	Table 3 MedQA row	Table 3

What To Try In 7 Days

Run GPT-4 zero-shot on a small QA sample to gauge reasoning gains vs. your baseline.

Test dynamic K-nearest few-shot for document-level classification and compare to static one-shot.

Fine-tune a PubMedBERT/BioBERT model on a small labeled extraction task and measure lift versus LLM zero-shot outputs with manual spot checks for hallucinations.

Optimization Features

Training Optimization

LoRA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://doi.org/10.5281/zenodo.14025500

Data URLs

https://doi.org/10.5281/zenodo.14025500

Risks & Boundaries

Limitations

Benchmarks and metrics are biased toward supervised tasks; may underrate LLM benefits in underrepresented reasoning tasks.

Open-source LLaMA variants evaluated at 13B only; larger or newer open models may differ.

When Not To Use

Do not rely on zero/few-shot LLM outputs for high-stakes extraction tasks without validation.

Avoid GPT-4 for large-scale batch inference when cost is prohibitive unless its accuracy gain justifies expense.

Failure Modes

Missing output (no answer returned)

Inconsistent output formatting that breaks automatic parsers

Core Entities

Models

GPT-3.5 (gpt-3.5-turbo-16k-0613)GPT-4 (gpt-4-0613, gpt-4-32k-0613)LLaMA 2 13BPMC-LLaMA 13BBioBERTPubMedBERTBART

Metrics

Macro F1Micro F1Entity-level F1AccuracyROUGE-LBERTScoreBARTScoreReadability (FKGL, DCRS)

Datasets

BC5CDR-chemicalNCBI-diseaseChemProtDDI2013HoCLitCovidMedQAPubMedQAPubMed Text SummarizationMS^2Cochrane PLSPLOS Text Simplification

Benchmarks

12-dataset BioNLP benchmark (6 task families)

Context Entities

Models

BioGPTBioMedLMPMC-LLaMA 7B/13BMeditron

Metrics

Accuracy

Datasets

PubMedPMC articlesUSMLE-style question sets

Benchmarks

MedQA, PubMedQA (as reasoning-focused datasets)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Fine-tuned, domain-specific models still outperform zero- and few-shot LLMs on most BioNLP tasks.

GPT-4 strongly outperforms other models on reasoning-heavy medical question answering.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding