Overview
The paper gives clear head-to-head accuracy numbers on public and private datasets and case studies, but scope is limited to two datasets and no code was shared.
Citations9
Evidence Strength0.60
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 30%
Novelty: 50%
Why It Matters For Business
LLMs like GPT-4 can speed prototype building (no training set needed) and give readable explanations, but they don't yet match tuned clinical models; deploy cautiously and use hybrids.
Who Should Care
Summary TLDR
The authors convert dementia diagnosis into multiple-choice prompts and test GPT-4 (and GPT-3.5) against five supervised models on two clinical datasets (public ADNI and private PUMCH). GPT-4 beats older GPT-3.5 and some simple models, but it does not outperform the best supervised rule-based model (RRL). Key issues are sensitivity to input format, weak tabular handling, inability to fine-tune, and possible dataset leakage. Numbers: ADNI accuracy GPT-4 0.820 vs RRL 0.852; PUMCH-B GPT-4 0.737 vs RRL 0.789; PUMCH-T GPT-4 few-shot 0.632 vs RRL 0.763.
Problem Statement
Can an off-the-shelf LLM (GPT-4) replace or outperform traditional supervised AI methods for clinical dementia diagnosis, without fine-tuning, and how interpretable and faithful are its explanations?
Main Contribution
Design simple multiple-choice prompt templates that convert each patient record into a question for GPT-4.
Evaluate GPT-4 and GPT-3.5 versus five supervised models on two real clinical datasets (ADNI public, PUMCH private).
Key Findings
GPT-4 does not beat the best supervised model (RRL) on evaluated datasets.
GPT-4 substantially outperforms GPT-3.5 on these tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | ADNI: GPT-4 0.820 (0-shot, few-shot 0.820) | RRL 0.852 | −0.032 vs RRL | ADNI | Table 2 reports model accuracies | Table 2 |
| Accuracy | PUMCH-B: GPT-4 0.737 | RRL 0.789 | −0.052 vs RRL | PUMCH-B (binary Non-Dementia vs Dementia) | Table 2 reports model accuracies | Table 2 |
What To Try In 7 Days
Run GPT-4 prompts on a held-out sample and compare accuracy to your current model.
Build standardized input templates and automated sanity checks for numeric test scores.
Test few-shot examples to see if short prompting improves your use case.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Only two datasets tested (one private, one public), limiting generality.
GPT-4 cannot be fine-tuned in this study, so models cannot learn dataset-specific thresholds.
When Not To Use
As a drop-in replacement for validated clinical diagnostic models in production.
On raw tabular patient data without input standardization and error checks.
Failure Modes
False positives from over-interpreting slightly low test scores.
Missing mismatched cutoffs because GPT-4 may not apply numeric thresholds correctly.

