GPT-4 can act as an automatic clinical judge of ophthalmology chatbot answers; fine-tuning helps but can also harm generalization

February 15, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.3

Cost Impact Score

0.45

Citation Count

6

Authors

Ting Fang Tan, Kabilan Elangovan, Liyuan Jin, Yao Jie, Li Yong, Joshua Lim, Stanley Poh, Wei Yan Ng, Daniel Lim, Yuhe Ke, Nan Liu, Daniel Shu Wei Ting

Links

Abstract / PDF

Why It Matters For Business

Automated LLM evaluation with GPT-4 can scale clinical-quality review and spot dangerous hallucinations, cutting manual grading costs and speeding iteration for healthcare chatbots; validate automated scores against clinicians before releasing patient-facing features.

Summary TLDR

The authors fine-tuned five LLM variants (GPT-3.5 and four LLaMA2 variants) on 368 clinician-authored ophthalmology QnA pairs and tested 200 generated answers on a 40-question holdout plus 8 glaucoma questions. They used GPT-4 as an automated clinical evaluator (custom rubric) and compared its rankings to five clinician graders. GPT-4's rankings correlated strongly with clinicians (Spearman ≈0.90, Kendall τ ≈0.80), and GPT-4 identified clear factual errors and safety risks in LLM outputs. Key practical findings: GPT-3.5 scored highest on the test set by GPT-4 (87.1/100), but fine-tuning sometimes reduced performance on unseen glaucoma items (native GPT-3.5 94.5 vs fine-tuned 73.1). Small, in

Problem Statement

Human grading of medical chatbot answers is slow and costly. The paper asks whether GPT-4 can be guided by a clinician-designed rubric to reproduce clinician ranking of LLM answers to common ophthalmology questions and whether fine-tuning LLMs on a small domain dataset reliably improves safety and accuracy.

Main Contribution

Created a clinician-written ophthalmology dataset: 400 QnA pairs (368 fine-tune / 40 test + 8 glaucoma holdouts).

Fine-tuned GPT-3.5 and four LLaMA2 variants using the 368-pair domain set and standardized training config (LoRA, 3 epochs).

Built a GPT-4 evaluation pipeline with a clinician-designed rubric (clinical accuracy, relevance, patient safety, readability) and compared GPT-4 rankings to 5 human clinicians.

Measured statistical agreement (Spearman, Kendall Tau, Cohen's Kappa) and conducted qualitative error analyses including glaucoma sub-analysis.

Key Findings

GPT-4 automated rankings strongly matched clinician rankings on the test set.

NumbersSpearman ≈0.90; Kendall Tau ≈0.80; Cohen's Kappa ≈0.50

GPT-3.5 responses received the highest average GPT-4 score on the test set.

NumbersGPT-3.5 = 87.1; LLAMA2-13b = 80.9; LLAMA2-13b-chat = 75.5

Fine-tuning sometimes degraded performance on unseen glaucoma questions.

NumbersNative GPT-3.5 = 94.5 vs fine-tuned GPT-3.5 = 73.1 (GPT-4 scores on glaucoma subset)

GPT-4 correctly flagged a dangerous hallucination in a LLaMA2 response (non-existent 'foetal cataract surgery').

Results

GPT-4 evaluation score (test set)

ValueGPT-3.5 87.1; LLAMA2-13b 80.9; LLAMA2-13b-chat 75.5; LLAMA2-7b-chat 70.0; LLAMA2-7b 68.8

Agreement between GPT-4 rankings and clinicians

ValueSpearman ≈0.90; Kendall Tau ≈0.80; Cohen's Kappa ≈0.50 (combined grading)

Glaucoma sub-analysis GPT-4 scores

ValueLLAMA2-13b 90; LLAMA2-13b-chat 76.2; GPT-3.5 73.1; LLAMA2-7b-chat 69.4; LLAMA2-7b 53.1

Fine-tuning time (training cost proxy)

ValueLLAMA2-7b 6:25; LLAMA2-7b-Chat 6:30; LLAMA2-13b 11:58; LLAMA2-13b-Chat 12:12; GPT-3.5 38:32 (min:sec)

Who Should Care

What To Try In 7 Days

Run GPT-4 (or similar) as a first-pass judge using a short clinician rubric for new medical-answer candidates.

Keep an untouched baseline model (native) alongside fine-tuned variants and compare on held-out subspecialty questions.

Add a human spot-check loop for any responses flagged by GPT-4 as risky or low-scoring.

Optimization Features

Token Efficiency

  • Prompt and answer length capped at 256 tokens

Infra Optimization

  • Single NVIDIA RTX 4090 GPU; H2O.ai LM Studio defaults used for LLaMA2

Model Optimization

  • LoRA

System Optimization

  • Consistent system prompt across models to standardize behavior

Training Optimization

  • Gradient checkpointing, mixed precision

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small, clinician-crafted dataset (400 QnA total) limits generalizability.
  • Fine-tuning set omitted glaucoma deliberately; subtopic generalization varied widely.
  • Human grader variability: one clinician showed poor agreement with others.
  • No public release of code or dataset in the paper to reproduce exact experiments.

When Not To Use

  • Do not deploy evaluated models as autonomous patient-facing chatbots without clinician oversight.
  • Avoid trusting a single GPT-4 evaluation for release decisions on high-risk content without human review.
  • Do not assume fine-tuning always improves performance across unseen subspecialties.

Failure Modes

  • Confident hallucinations of dangerous medical procedures (example: non-existent 'foetal cataract surgery').
  • Fine-tuning-induced forgetting where domain-tuning degrades broader medical knowledge.
  • Inter-grader variance causing noisy human labels for calibration.
  • Automated judge bias toward certain model outputs (possible preference for GPT- family).

Core Entities

Models

  • GPT-3.5
  • GPT-4 (evaluator)
  • LLAMA2-7b
  • LLAMA2-7b-Chat
  • LLAMA2-13b
  • LLAMA2-13b-Chat

Metrics

  • GPT-4 score (0-100)
  • Spearman correlation
  • Kendall Tau
  • Cohen's Kappa

Datasets

  • Clinician-crafted ophthalmology QnA (400 total; 368 fine-tune / 40 test + 8 glaucoma holdouts)