Overview
The evaluation uses clear metrics and a judge model but is zero-shot only, uses a sampled subset, and relies on one judge model. Results are indicative for benchmarking but not sufficient alone for clinical deployment.
Citations0
Evidence Strength0.70
Confidence0.75
Risk Signals12
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 30%
Novelty: 30%
Why It Matters For Business
This paper shows large open models can give much better zero-shot medical answers, but parameter-efficient architectures can approach top quality. For product teams, that means you can trade compute cost for accuracy and still get usable clinical answers if you pick the right model and add quality checks.
Who Should Care
Summary TLDR
The authors ran a zero-shot benchmark of five LLMs on 3,000 iCliniq medical Q&A pairs. Llama-3.3-70B led on BLEU/ROUGE and on an LLM-as-a-judge clinical score. Llama-4-Maverick-17B delivered near-best quality with far fewer parameters. GPT-5-mini performed poorly in lexical metrics but scored reasonably on safety. The study argues for combining n-gram metrics with LLM judges to capture clinical quality.
Problem Statement
Medical QA needs reliable, clinically safe models. Existing evaluations rely on n-gram metrics that miss factual and safety aspects. Practitioners need a clearer, practical ranking of recent LLMs in real medical Q&A without task-specific fine-tuning.
Main Contribution
Zero-shot benchmark of five contemporary LLMs (Llama variants and GPT-5-mini) on a 3,000-sample subset of the 38k iCliniq medical QA dataset using BLEU and ROUGE.
A dual evaluation: standard automatic metrics plus an LLM-as-a-judge (Claude Sonnet 4) scoring medical accuracy, safety, completeness, clarity, and helpfulness with a weighted rubric.
Key Findings
Llama-3.3-70B achieved the highest automatic and judge scores.
Llama-4-Maverick-17B matches most of the top model's performance with fewer parameters.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| BLEU-1 | Llama-3.3-70B 0.2207; Llama-4-Maverick-17B 0.2089; Llama-3.2-3B 0.2012; Llama-3-8B 0.1739; GPT-5-mini 0.0124 | MedLM best 0.0998 (GPT-2) | Llama-3.3-70B ≈ +0.121 over best baseline | iCliniq 3,000-sample subset | Table II | Table II |
| ROUGE-1 | Llama-3.3-70B 0.2761; Llama-4-Maverick-17B 0.2597; Llama-3.2-3B 0.2588; Llama-3-8B 0.2419; GPT-5-mini 0.2024 | MedLM best 0.0022 (GPT-3.5) | Llama-3.3-70B ≈ +0.2739 over best baseline | iCliniq 3,000-sample subset | Table II and III | Tables II/III |
What To Try In 7 Days
Run a 200–300 sample zero-shot test on your target clinical topics using Llama-4-Maverick-17B and Llama-3.3-70B to compare cost vs. quality.
Add an LLM-as-a-judge step (use a vetted judge model) to your evaluation pipeline to catch factual and safety gaps not visible to BLEU/ROUGE.
Conduct spot checks in high-risk specialties (cardiology, pediatrics) to measure variability and decide where human review is mandatory.
Reproducibility
Risks & Boundaries
Limitations
Zero-shot only: no task-specific fine-tuning was performed, so real-world deployed systems may require extra tuning.
Subset sampling: evaluation used a 3,000-pair random subset from 38k, which may not capture all specialty edge cases.
When Not To Use
Do not use these zero-shot results as proof of clinical safety for live patient-facing systems.
Avoid relying on lexical metrics alone to approve clinical outputs for deployment.
Failure Modes
Hallucinations: models may state incorrect medical facts despite fluent language.
Specialty variance: performance depends on question complexity and medical domain.

