Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Overview

Decision SnapshotNeeds Validation

The evaluation uses clear metrics and a judge model but is zero-shot only, uses a sampled subset, and relies on one judge model. Results are indicative for benchmarking but not sufficient alone for clinical deployment.

Citations0

Evidence Strength0.70

Confidence0.75

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 30%

Novelty: 30%

Authors

Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Tareque Mohmud Chowdhury

Links

Abstract / PDF / Data

Why It Matters For Business

This paper shows large open models can give much better zero-shot medical answers, but parameter-efficient architectures can approach top quality. For product teams, that means you can trade compute cost for accuracy and still get usable clinical answers if you pick the right model and add quality checks.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors ran a zero-shot benchmark of five LLMs on 3,000 iCliniq medical Q&A pairs. Llama-3.3-70B led on BLEU/ROUGE and on an LLM-as-a-judge clinical score. Llama-4-Maverick-17B delivered near-best quality with far fewer parameters. GPT-5-mini performed poorly in lexical metrics but scored reasonably on safety. The study argues for combining n-gram metrics with LLM judges to capture clinical quality.

Problem Statement

Medical QA needs reliable, clinically safe models. Existing evaluations rely on n-gram metrics that miss factual and safety aspects. Practitioners need a clearer, practical ranking of recent LLMs in real medical Q&A without task-specific fine-tuning.

Main Contribution

Zero-shot benchmark of five contemporary LLMs (Llama variants and GPT-5-mini) on a 3,000-sample subset of the 38k iCliniq medical QA dataset using BLEU and ROUGE.

A dual evaluation: standard automatic metrics plus an LLM-as-a-judge (Claude Sonnet 4) scoring medical accuracy, safety, completeness, clarity, and helpfulness with a weighted rubric.

Key Findings

Llama-3.3-70B achieved the highest automatic and judge scores.

NumbersBLEU-1 0.2207; ROUGE-1 0.2761; Judge 4.40/5

Practical UseIf you need the strongest zero-shot medical QA output and have compute, prioritize large models like Llama-3.3-70B.

Evidence RefTables II and IV

Llama-4-Maverick-17B matches most of the top model's performance with fewer parameters.

NumbersROUGE-1 0.2597 (~94% of Llama-3.3 ROUGE-1); Judge 4.23/5

Practical UseFor resource-limited deployments, Llama-4-Maverick-17B offers near-top medical QA quality at lower compute cost.

Evidence RefTable II; section IV.C

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BLEU-1	Llama-3.3-70B 0.2207; Llama-4-Maverick-17B 0.2089; Llama-3.2-3B 0.2012; Llama-3-8B 0.1739; GPT-5-mini 0.0124	MedLM best 0.0998 (GPT-2)	Llama-3.3-70B ≈ +0.121 over best baseline	iCliniq 3,000-sample subset	Table II	Table II
ROUGE-1	Llama-3.3-70B 0.2761; Llama-4-Maverick-17B 0.2597; Llama-3.2-3B 0.2588; Llama-3-8B 0.2419; GPT-5-mini 0.2024	MedLM best 0.0022 (GPT-3.5)	Llama-3.3-70B ≈ +0.2739 over best baseline	iCliniq 3,000-sample subset	Table II and III	Tables II/III

What To Try In 7 Days

Run a 200–300 sample zero-shot test on your target clinical topics using Llama-4-Maverick-17B and Llama-3.3-70B to compare cost vs. quality.

Add an LLM-as-a-judge step (use a vetted judge model) to your evaluation pipeline to catch factual and safety gaps not visible to BLEU/ROUGE.

Conduct spot checks in high-risk specialties (cardiology, pediatrics) to measure variability and decide where human review is mandatory.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://www.kaggle.com/datasets/henry41148/icliniq-medical-qa-38k

Risks & Boundaries

Limitations

Zero-shot only: no task-specific fine-tuning was performed, so real-world deployed systems may require extra tuning.

Subset sampling: evaluation used a 3,000-pair random subset from 38k, which may not capture all specialty edge cases.

When Not To Use

Do not use these zero-shot results as proof of clinical safety for live patient-facing systems.

Avoid relying on lexical metrics alone to approve clinical outputs for deployment.

Failure Modes

Hallucinations: models may state incorrect medical facts despite fluent language.

Specialty variance: performance depends on question complexity and medical domain.

Core Entities

Models

Llama-3-8B-InstructLlama-3.2-3BLlama-3.3-70B-InstructLlama-4-Maverick-17B-128E-InstructGPT-5-mini

Metrics

BLEU-1BLEU-4ROUGE-1ROUGE-2ROUGE-LLLM-as-a-Judge overall and per-dimension scores

Datasets

iCliniq medical QA (38k) — subset 3,000 sampled

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Llama-3.3-70B achieved the highest automatic and judge scores.

Llama-4-Maverick-17B matches most of the top model's performance with fewer parameters.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding