GPT-4 is promising for dementia screening but does not yet beat the best traditional models

Overview

Decision SnapshotNeeds Validation

The paper gives clear head-to-head accuracy numbers on public and private datasets and case studies, but scope is limited to two datasets and no code was shared.

Citations9

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 50%

Authors

Zhuo Wang, Rongzhen Li, Bowen Dong, Jie Wang, Xiuxing Li, Ning Liu, Chenhui Mao, Wei Zhang, Liling Dong, Jing Gao, Jianyong Wang

Links

Abstract / PDF / Data

Why It Matters For Business

LLMs like GPT-4 can speed prototype building (no training set needed) and give readable explanations, but they don't yet match tuned clinical models; deploy cautiously and use hybrids.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

The authors convert dementia diagnosis into multiple-choice prompts and test GPT-4 (and GPT-3.5) against five supervised models on two clinical datasets (public ADNI and private PUMCH). GPT-4 beats older GPT-3.5 and some simple models, but it does not outperform the best supervised rule-based model (RRL). Key issues are sensitivity to input format, weak tabular handling, inability to fine-tune, and possible dataset leakage. Numbers: ADNI accuracy GPT-4 0.820 vs RRL 0.852; PUMCH-B GPT-4 0.737 vs RRL 0.789; PUMCH-T GPT-4 few-shot 0.632 vs RRL 0.763.

Problem Statement

Can an off-the-shelf LLM (GPT-4) replace or outperform traditional supervised AI methods for clinical dementia diagnosis, without fine-tuning, and how interpretable and faithful are its explanations?

Main Contribution

Design simple multiple-choice prompt templates that convert each patient record into a question for GPT-4.

Evaluate GPT-4 and GPT-3.5 versus five supervised models on two real clinical datasets (ADNI public, PUMCH private).

Key Findings

GPT-4 does not beat the best supervised model (RRL) on evaluated datasets.

NumbersADNI: GPT-4 0.820 vs RRL 0.852; PUMCH-T few-shot: GPT-4 0.632 vs RRL 0.763

Practical UseDo not replace tuned clinical models with GPT-4 today; keep supervised models like RRL for best accuracy.

Evidence RefTable 2

GPT-4 substantially outperforms GPT-3.5 on these tasks.

NumbersADNI: GPT-4 0.820 vs GPT-3.5 few-shot 0.639

Practical UseIf using an LLM, prefer GPT-4 over GPT-3.5 for clinical feature-based prediction.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	ADNI: GPT-4 0.820 (0-shot, few-shot 0.820)	RRL 0.852	−0.032 vs RRL	ADNI	Table 2 reports model accuracies	Table 2
Accuracy	PUMCH-B: GPT-4 0.737	RRL 0.789	−0.052 vs RRL	PUMCH-B (binary Non-Dementia vs Dementia)	Table 2 reports model accuracies	Table 2

What To Try In 7 Days

Run GPT-4 prompts on a held-out sample and compare accuracy to your current model.

Build standardized input templates and automated sanity checks for numeric test scores.

Test few-shot examples to see if short prompting improves your use case.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Data URLs

ADNI is public (adni.loni.usc.edu); PUMCH is private

Risks & Boundaries

Limitations

Only two datasets tested (one private, one public), limiting generality.

GPT-4 cannot be fine-tuned in this study, so models cannot learn dataset-specific thresholds.

When Not To Use

As a drop-in replacement for validated clinical diagnostic models in production.

On raw tabular patient data without input standardization and error checks.

Failure Modes

False positives from over-interpreting slightly low test scores.

Missing mismatched cutoffs because GPT-4 may not apply numeric thresholds correctly.

Core Entities

Models

GPT-4GPT-3.5RRLLogistic RegressionCART (Decision Tree)Random ForestXGBoost

Metrics

Accuracy

Datasets

ADNIPUMCH (private)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 does not beat the best supervised model (RRL) on evaluated datasets.

GPT-4 substantially outperforms GPT-3.5 on these tasks.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Key finding