GPT-4 is promising for dementia screening but does not yet beat the best traditional models

June 2, 20236 min

Overview

Production Readiness

0.3

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

9

Authors

Zhuo Wang, Rongzhen Li, Bowen Dong, Jie Wang, Xiuxing Li, Ning Liu, Chenhui Mao, Wei Zhang, Liling Dong, Jing Gao, Jianyong Wang

Links

Abstract / PDF

Why It Matters For Business

LLMs like GPT-4 can speed prototype building (no training set needed) and give readable explanations, but they don't yet match tuned clinical models; deploy cautiously and use hybrids.

Summary TLDR

The authors convert dementia diagnosis into multiple-choice prompts and test GPT-4 (and GPT-3.5) against five supervised models on two clinical datasets (public ADNI and private PUMCH). GPT-4 beats older GPT-3.5 and some simple models, but it does not outperform the best supervised rule-based model (RRL). Key issues are sensitivity to input format, weak tabular handling, inability to fine-tune, and possible dataset leakage. Numbers: ADNI accuracy GPT-4 0.820 vs RRL 0.852; PUMCH-B GPT-4 0.737 vs RRL 0.789; PUMCH-T GPT-4 few-shot 0.632 vs RRL 0.763.

Problem Statement

Can an off-the-shelf LLM (GPT-4) replace or outperform traditional supervised AI methods for clinical dementia diagnosis, without fine-tuning, and how interpretable and faithful are its explanations?

Main Contribution

Design simple multiple-choice prompt templates that convert each patient record into a question for GPT-4.

Evaluate GPT-4 and GPT-3.5 versus five supervised models on two real clinical datasets (ADNI public, PUMCH private).

Qualitatively compare GPT-4 explanations to doctors and list current limitations and future directions.

Key Findings

GPT-4 does not beat the best supervised model (RRL) on evaluated datasets.

NumbersADNI: GPT-4 0.820 vs RRL 0.852; PUMCH-T few-shot: GPT-4 0.632 vs RRL 0.763

GPT-4 substantially outperforms GPT-3.5 on these tasks.

NumbersADNI: GPT-4 0.820 vs GPT-3.5 few-shot 0.639

Few-shot prompting can improve performance over zero-shot in some cases.

NumbersPUMCH-T: 0-shot 0.553 -> few-shot 0.632 for GPT-4

GPT-4's outputs and explanations are sensitive to input quality and prompt format.

NumbersQualitative case studies show misdiagnosis when test totals or feature descriptions are inconsistent

Results

Accuracy

ValueADNI: GPT-4 0.820 (0-shot, few-shot 0.820)

BaselineRRL 0.852

Accuracy

ValuePUMCH-B: GPT-4 0.737

BaselineRRL 0.789

Accuracy

ValuePUMCH-T: GPT-4 0.553 (0-shot), 0.632 (few-shot)

BaselineRRL 0.763

Accuracy

ValueADNI: GPT-3.5 few-shot 0.639

BaselineGPT-4 0.820

Who Should Care

What To Try In 7 Days

Run GPT-4 prompts on a held-out sample and compare accuracy to your current model.

Build standardized input templates and automated sanity checks for numeric test scores.

Test few-shot examples to see if short prompting improves your use case.

Reproducibility

Data Urls

  • ADNI is public (adni.loni.usc.edu); PUMCH is private

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only two datasets tested (one private, one public), limiting generality.
  • GPT-4 cannot be fine-tuned in this study, so models cannot learn dataset-specific thresholds.
  • LLM is sensitive to prompt wording and input quality; tabular handling is weak.
  • Explanations may list reasons without faithful quantitative contribution to decisions.

When Not To Use

  • As a drop-in replacement for validated clinical diagnostic models in production.
  • On raw tabular patient data without input standardization and error checks.
  • For high-stakes automated diagnosis without human oversight.

Failure Modes

  • False positives from over-interpreting slightly low test scores.
  • Missing mismatched cutoffs because GPT-4 may not apply numeric thresholds correctly.
  • Inconsistent explanations that do not match final decision.
  • Performance inflation on public datasets due to potential training-data leakage.

Core Entities

Models

  • GPT-4
  • GPT-3.5
  • RRL
  • Logistic Regression
  • CART (Decision Tree)
  • Random Forest
  • XGBoost

Metrics

  • Accuracy

Datasets

  • ADNI
  • PUMCH (private)