Overview
Production Readiness
0.3
Novelty Score
0.3
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
This paper shows large open models can give much better zero-shot medical answers, but parameter-efficient architectures can approach top quality. For product teams, that means you can trade compute cost for accuracy and still get usable clinical answers if you pick the right model and add quality checks.
Summary TLDR
The authors ran a zero-shot benchmark of five LLMs on 3,000 iCliniq medical Q&A pairs. Llama-3.3-70B led on BLEU/ROUGE and on an LLM-as-a-judge clinical score. Llama-4-Maverick-17B delivered near-best quality with far fewer parameters. GPT-5-mini performed poorly in lexical metrics but scored reasonably on safety. The study argues for combining n-gram metrics with LLM judges to capture clinical quality.
Problem Statement
Medical QA needs reliable, clinically safe models. Existing evaluations rely on n-gram metrics that miss factual and safety aspects. Practitioners need a clearer, practical ranking of recent LLMs in real medical Q&A without task-specific fine-tuning.
Main Contribution
Zero-shot benchmark of five contemporary LLMs (Llama variants and GPT-5-mini) on a 3,000-sample subset of the 38k iCliniq medical QA dataset using BLEU and ROUGE.
A dual evaluation: standard automatic metrics plus an LLM-as-a-judge (Claude Sonnet 4) scoring medical accuracy, safety, completeness, clarity, and helpfulness with a weighted rubric.
Empirical scaling and efficiency analysis showing Llama-3.3-70B as top performer and Llama-4-Maverick-17B as a parameter-efficient alternative suitable for constrained deployments.
Key Findings
Llama-3.3-70B achieved the highest automatic and judge scores.
Llama-4-Maverick-17B matches most of the top model's performance with fewer parameters.
GPT-5-mini scored poorly on lexical overlap but showed decent safety ratings.
Automatic n-gram metrics correlate with judge scores but miss clinical dimensions.
Performance varies a lot by question type and specialty.
Results
BLEU-1
ROUGE-1
ROUGE-L
LLM-as-a-Judge overall score
High-quality response rate (judge ≥4.0)
Who Should Care
What To Try In 7 Days
Run a 200–300 sample zero-shot test on your target clinical topics using Llama-4-Maverick-17B and Llama-3.3-70B to compare cost vs. quality.
Add an LLM-as-a-judge step (use a vetted judge model) to your evaluation pipeline to catch factual and safety gaps not visible to BLEU/ROUGE.
Conduct spot checks in high-risk specialties (cardiology, pediatrics) to measure variability and decide where human review is mandatory.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Zero-shot only: no task-specific fine-tuning was performed, so real-world deployed systems may require extra tuning.
- Subset sampling: evaluation used a 3,000-pair random subset from 38k, which may not capture all specialty edge cases.
- Judge bias and single-judge model: clinical judgments rely on Claude Sonnet 4; judge model bias can affect scores.
- Automatic metrics' blind spots: BLEU/ROUGE measure lexical overlap and miss factual correctness and safety without the judge step.
- Unclear GPT-5-mini setup: authors note potential evaluation or configuration issues for GPT-5-mini affecting performance.
When Not To Use
- Do not use these zero-shot results as proof of clinical safety for live patient-facing systems.
- Avoid relying on lexical metrics alone to approve clinical outputs for deployment.
- Do not assume consistent performance across all medical specialties without targeted validation.
Failure Modes
- Hallucinations: models may state incorrect medical facts despite fluent language.
- Specialty variance: performance depends on question complexity and medical domain.
- Judge-model bias: LLM-as-a-judge can mis-evaluate factual accuracy or safety.
- Underreporting of calibration issues: small models might appear safe but omit critical clinical content.
Core Entities
Models
- Llama-3-8B-Instruct
- Llama-3.2-3B
- Llama-3.3-70B-Instruct
- Llama-4-Maverick-17B-128E-Instruct
- GPT-5-mini
Metrics
- BLEU-1
- BLEU-4
- ROUGE-1
- ROUGE-2
- ROUGE-L
- LLM-as-a-Judge overall and per-dimension scores
Datasets
- iCliniq medical QA (38k) — subset 3,000 sampled

