Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

February 16, 20267 min

Overview

Production Readiness

0.3

Novelty Score

0.3

Cost Impact Score

0.6

Citation Count

0

Authors

Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Tareque Mohmud Chowdhury

Links

Abstract / PDF

Why It Matters For Business

This paper shows large open models can give much better zero-shot medical answers, but parameter-efficient architectures can approach top quality. For product teams, that means you can trade compute cost for accuracy and still get usable clinical answers if you pick the right model and add quality checks.

Summary TLDR

The authors ran a zero-shot benchmark of five LLMs on 3,000 iCliniq medical Q&A pairs. Llama-3.3-70B led on BLEU/ROUGE and on an LLM-as-a-judge clinical score. Llama-4-Maverick-17B delivered near-best quality with far fewer parameters. GPT-5-mini performed poorly in lexical metrics but scored reasonably on safety. The study argues for combining n-gram metrics with LLM judges to capture clinical quality.

Problem Statement

Medical QA needs reliable, clinically safe models. Existing evaluations rely on n-gram metrics that miss factual and safety aspects. Practitioners need a clearer, practical ranking of recent LLMs in real medical Q&A without task-specific fine-tuning.

Main Contribution

Zero-shot benchmark of five contemporary LLMs (Llama variants and GPT-5-mini) on a 3,000-sample subset of the 38k iCliniq medical QA dataset using BLEU and ROUGE.

A dual evaluation: standard automatic metrics plus an LLM-as-a-judge (Claude Sonnet 4) scoring medical accuracy, safety, completeness, clarity, and helpfulness with a weighted rubric.

Empirical scaling and efficiency analysis showing Llama-3.3-70B as top performer and Llama-4-Maverick-17B as a parameter-efficient alternative suitable for constrained deployments.

Key Findings

Llama-3.3-70B achieved the highest automatic and judge scores.

NumbersBLEU-1 0.2207; ROUGE-1 0.2761; Judge 4.40/5

Llama-4-Maverick-17B matches most of the top model's performance with fewer parameters.

NumbersROUGE-1 0.2597 (~94% of Llama-3.3 ROUGE-1); Judge 4.23/5

GPT-5-mini scored poorly on lexical overlap but showed decent safety ratings.

NumbersBLEU-1 0.0124; ROUGE-L 0.0914; Judge overall 3.16/5; Safety 3.80/5

Automatic n-gram metrics correlate with judge scores but miss clinical dimensions.

NumbersROUGE-1 r=0.89 with judge overall; BLEU r=0.67 with medical accuracy

Performance varies a lot by question type and specialty.

NumbersBLEU-1 std devs ±0.0622 to ±0.0956 reported

Results

BLEU-1

ValueLlama-3.3-70B 0.2207; Llama-4-Maverick-17B 0.2089; Llama-3.2-3B 0.2012; Llama-3-8B 0.1739; GPT-5-mini 0.0124

BaselineMedLM best 0.0998 (GPT-2)

ROUGE-1

ValueLlama-3.3-70B 0.2761; Llama-4-Maverick-17B 0.2597; Llama-3.2-3B 0.2588; Llama-3-8B 0.2419; GPT-5-mini 0.2024

BaselineMedLM best 0.0022 (GPT-3.5)

ROUGE-L

ValueLlama-3.3-70B 0.1306; Llama-4-Maverick-17B 0.1260; Llama-3.2-3B 0.1258; Llama-3-8B 0.1219; GPT-5-mini 0.0914

BaselineMedLM best 0.0019 (GPT-3.5)

LLM-as-a-Judge overall score

ValueLlama-3.3-70B 4.40; Llama-4-Maverick-17B 4.23; Llama-3-8B 3.77; Llama-3.2-3B 3.20; GPT-5-mini 3.16 (out of 5)

High-quality response rate (judge ≥4.0)

ValueLlama-3.3-70B 88%; Llama-4-Maverick-17B 84%; Llama-3-8B 65%; Llama-3.2-3B 25%; GPT-5-mini 23%

Who Should Care

What To Try In 7 Days

Run a 200–300 sample zero-shot test on your target clinical topics using Llama-4-Maverick-17B and Llama-3.3-70B to compare cost vs. quality.

Add an LLM-as-a-judge step (use a vetted judge model) to your evaluation pipeline to catch factual and safety gaps not visible to BLEU/ROUGE.

Conduct spot checks in high-risk specialties (cardiology, pediatrics) to measure variability and decide where human review is mandatory.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Zero-shot only: no task-specific fine-tuning was performed, so real-world deployed systems may require extra tuning.
  • Subset sampling: evaluation used a 3,000-pair random subset from 38k, which may not capture all specialty edge cases.
  • Judge bias and single-judge model: clinical judgments rely on Claude Sonnet 4; judge model bias can affect scores.
  • Automatic metrics' blind spots: BLEU/ROUGE measure lexical overlap and miss factual correctness and safety without the judge step.
  • Unclear GPT-5-mini setup: authors note potential evaluation or configuration issues for GPT-5-mini affecting performance.

When Not To Use

  • Do not use these zero-shot results as proof of clinical safety for live patient-facing systems.
  • Avoid relying on lexical metrics alone to approve clinical outputs for deployment.
  • Do not assume consistent performance across all medical specialties without targeted validation.

Failure Modes

  • Hallucinations: models may state incorrect medical facts despite fluent language.
  • Specialty variance: performance depends on question complexity and medical domain.
  • Judge-model bias: LLM-as-a-judge can mis-evaluate factual accuracy or safety.
  • Underreporting of calibration issues: small models might appear safe but omit critical clinical content.

Core Entities

Models

  • Llama-3-8B-Instruct
  • Llama-3.2-3B
  • Llama-3.3-70B-Instruct
  • Llama-4-Maverick-17B-128E-Instruct
  • GPT-5-mini

Metrics

  • BLEU-1
  • BLEU-4
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • LLM-as-a-Judge overall and per-dimension scores

Datasets

  • iCliniq medical QA (38k) — subset 3,000 sampled