Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

February 16, 20267 min

Overview

Decision SnapshotNeeds Validation

The evaluation uses clear metrics and a judge model but is zero-shot only, uses a sampled subset, and relies on one judge model. Results are indicative for benchmarking but not sufficient alone for clinical deployment.

Citations0

Evidence Strength0.70

Confidence0.75

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 30%

Novelty: 30%

Authors

Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Tareque Mohmud Chowdhury

Links

Abstract / PDF / Data

Why It Matters For Business

This paper shows large open models can give much better zero-shot medical answers, but parameter-efficient architectures can approach top quality. For product teams, that means you can trade compute cost for accuracy and still get usable clinical answers if you pick the right model and add quality checks.

Who Should Care

Summary TLDR

The authors ran a zero-shot benchmark of five LLMs on 3,000 iCliniq medical Q&A pairs. Llama-3.3-70B led on BLEU/ROUGE and on an LLM-as-a-judge clinical score. Llama-4-Maverick-17B delivered near-best quality with far fewer parameters. GPT-5-mini performed poorly in lexical metrics but scored reasonably on safety. The study argues for combining n-gram metrics with LLM judges to capture clinical quality.

Problem Statement

Medical QA needs reliable, clinically safe models. Existing evaluations rely on n-gram metrics that miss factual and safety aspects. Practitioners need a clearer, practical ranking of recent LLMs in real medical Q&A without task-specific fine-tuning.

Main Contribution

Zero-shot benchmark of five contemporary LLMs (Llama variants and GPT-5-mini) on a 3,000-sample subset of the 38k iCliniq medical QA dataset using BLEU and ROUGE.

A dual evaluation: standard automatic metrics plus an LLM-as-a-judge (Claude Sonnet 4) scoring medical accuracy, safety, completeness, clarity, and helpfulness with a weighted rubric.

Key Findings

Llama-3.3-70B achieved the highest automatic and judge scores.

NumbersBLEU-1 0.2207; ROUGE-1 0.2761; Judge 4.40/5

Practical UseIf you need the strongest zero-shot medical QA output and have compute, prioritize large models like Llama-3.3-70B.

Evidence RefTables II and IV

Llama-4-Maverick-17B matches most of the top model's performance with fewer parameters.

NumbersROUGE-1 0.2597 (~94% of Llama-3.3 ROUGE-1); Judge 4.23/5

Practical UseFor resource-limited deployments, Llama-4-Maverick-17B offers near-top medical QA quality at lower compute cost.

Evidence RefTable II; section IV.C

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BLEU-1Llama-3.3-70B 0.2207; Llama-4-Maverick-17B 0.2089; Llama-3.2-3B 0.2012; Llama-3-8B 0.1739; GPT-5-mini 0.0124MedLM best 0.0998 (GPT-2)Llama-3.3-70B ≈ +0.121 over best baselineiCliniq 3,000-sample subsetTable IITable II
ROUGE-1Llama-3.3-70B 0.2761; Llama-4-Maverick-17B 0.2597; Llama-3.2-3B 0.2588; Llama-3-8B 0.2419; GPT-5-mini 0.2024MedLM best 0.0022 (GPT-3.5)Llama-3.3-70B ≈ +0.2739 over best baselineiCliniq 3,000-sample subsetTable II and IIITables II/III

What To Try In 7 Days

Run a 200–300 sample zero-shot test on your target clinical topics using Llama-4-Maverick-17B and Llama-3.3-70B to compare cost vs. quality.

Add an LLM-as-a-judge step (use a vetted judge model) to your evaluation pipeline to catch factual and safety gaps not visible to BLEU/ROUGE.

Conduct spot checks in high-risk specialties (cardiology, pediatrics) to measure variability and decide where human review is mandatory.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Zero-shot only: no task-specific fine-tuning was performed, so real-world deployed systems may require extra tuning.

Subset sampling: evaluation used a 3,000-pair random subset from 38k, which may not capture all specialty edge cases.

When Not To Use

Do not use these zero-shot results as proof of clinical safety for live patient-facing systems.

Avoid relying on lexical metrics alone to approve clinical outputs for deployment.

Failure Modes

Hallucinations: models may state incorrect medical facts despite fluent language.

Specialty variance: performance depends on question complexity and medical domain.

Core Entities

Models

Llama-3-8B-InstructLlama-3.2-3BLlama-3.3-70B-InstructLlama-4-Maverick-17B-128E-InstructGPT-5-mini

Metrics

BLEU-1BLEU-4ROUGE-1ROUGE-2ROUGE-LLLM-as-a-Judge overall and per-dimension scores

Datasets

iCliniq medical QA (38k) — subset 3,000 sampled