GPT-4 is promising for dementia screening but does not yet beat the best traditional models

June 2, 20236 min

Overview

Decision SnapshotNeeds Validation

The paper gives clear head-to-head accuracy numbers on public and private datasets and case studies, but scope is limited to two datasets and no code was shared.

Citations9

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 50%

Authors

Zhuo Wang, Rongzhen Li, Bowen Dong, Jie Wang, Xiuxing Li, Ning Liu, Chenhui Mao, Wei Zhang, Liling Dong, Jing Gao, Jianyong Wang

Links

Abstract / PDF / Data

Why It Matters For Business

LLMs like GPT-4 can speed prototype building (no training set needed) and give readable explanations, but they don't yet match tuned clinical models; deploy cautiously and use hybrids.

Who Should Care

Summary TLDR

The authors convert dementia diagnosis into multiple-choice prompts and test GPT-4 (and GPT-3.5) against five supervised models on two clinical datasets (public ADNI and private PUMCH). GPT-4 beats older GPT-3.5 and some simple models, but it does not outperform the best supervised rule-based model (RRL). Key issues are sensitivity to input format, weak tabular handling, inability to fine-tune, and possible dataset leakage. Numbers: ADNI accuracy GPT-4 0.820 vs RRL 0.852; PUMCH-B GPT-4 0.737 vs RRL 0.789; PUMCH-T GPT-4 few-shot 0.632 vs RRL 0.763.

Problem Statement

Can an off-the-shelf LLM (GPT-4) replace or outperform traditional supervised AI methods for clinical dementia diagnosis, without fine-tuning, and how interpretable and faithful are its explanations?

Main Contribution

Design simple multiple-choice prompt templates that convert each patient record into a question for GPT-4.

Evaluate GPT-4 and GPT-3.5 versus five supervised models on two real clinical datasets (ADNI public, PUMCH private).

Key Findings

GPT-4 does not beat the best supervised model (RRL) on evaluated datasets.

NumbersADNI: GPT-4 0.820 vs RRL 0.852; PUMCH-T few-shot: GPT-4 0.632 vs RRL 0.763

Practical UseDo not replace tuned clinical models with GPT-4 today; keep supervised models like RRL for best accuracy.

Evidence RefTable 2

GPT-4 substantially outperforms GPT-3.5 on these tasks.

NumbersADNI: GPT-4 0.820 vs GPT-3.5 few-shot 0.639

Practical UseIf using an LLM, prefer GPT-4 over GPT-3.5 for clinical feature-based prediction.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyADNI: GPT-4 0.820 (0-shot, few-shot 0.820)RRL 0.852−0.032 vs RRLADNITable 2 reports model accuraciesTable 2
AccuracyPUMCH-B: GPT-4 0.737RRL 0.789−0.052 vs RRLPUMCH-B (binary Non-Dementia vs Dementia)Table 2 reports model accuraciesTable 2

What To Try In 7 Days

Run GPT-4 prompts on a held-out sample and compare accuracy to your current model.

Build standardized input templates and automated sanity checks for numeric test scores.

Test few-shot examples to see if short prompting improves your use case.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Data URLs

ADNI is public (adni.loni.usc.edu); PUMCH is private

Risks & Boundaries

Limitations

Only two datasets tested (one private, one public), limiting generality.

GPT-4 cannot be fine-tuned in this study, so models cannot learn dataset-specific thresholds.

When Not To Use

As a drop-in replacement for validated clinical diagnostic models in production.

On raw tabular patient data without input standardization and error checks.

Failure Modes

False positives from over-interpreting slightly low test scores.

Missing mismatched cutoffs because GPT-4 may not apply numeric thresholds correctly.

Core Entities

Models

GPT-4GPT-3.5RRLLogistic RegressionCART (Decision Tree)Random ForestXGBoost

Metrics

Accuracy

Datasets

ADNIPUMCH (private)