Overview
Production Readiness
0.3
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
9
Why It Matters For Business
LLMs like GPT-4 can speed prototype building (no training set needed) and give readable explanations, but they don't yet match tuned clinical models; deploy cautiously and use hybrids.
Summary TLDR
The authors convert dementia diagnosis into multiple-choice prompts and test GPT-4 (and GPT-3.5) against five supervised models on two clinical datasets (public ADNI and private PUMCH). GPT-4 beats older GPT-3.5 and some simple models, but it does not outperform the best supervised rule-based model (RRL). Key issues are sensitivity to input format, weak tabular handling, inability to fine-tune, and possible dataset leakage. Numbers: ADNI accuracy GPT-4 0.820 vs RRL 0.852; PUMCH-B GPT-4 0.737 vs RRL 0.789; PUMCH-T GPT-4 few-shot 0.632 vs RRL 0.763.
Problem Statement
Can an off-the-shelf LLM (GPT-4) replace or outperform traditional supervised AI methods for clinical dementia diagnosis, without fine-tuning, and how interpretable and faithful are its explanations?
Main Contribution
Design simple multiple-choice prompt templates that convert each patient record into a question for GPT-4.
Evaluate GPT-4 and GPT-3.5 versus five supervised models on two real clinical datasets (ADNI public, PUMCH private).
Qualitatively compare GPT-4 explanations to doctors and list current limitations and future directions.
Key Findings
GPT-4 does not beat the best supervised model (RRL) on evaluated datasets.
GPT-4 substantially outperforms GPT-3.5 on these tasks.
Few-shot prompting can improve performance over zero-shot in some cases.
GPT-4's outputs and explanations are sensitive to input quality and prompt format.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run GPT-4 prompts on a held-out sample and compare accuracy to your current model.
Build standardized input templates and automated sanity checks for numeric test scores.
Test few-shot examples to see if short prompting improves your use case.
Reproducibility
Data Urls
- ADNI is public (adni.loni.usc.edu); PUMCH is private
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only two datasets tested (one private, one public), limiting generality.
- GPT-4 cannot be fine-tuned in this study, so models cannot learn dataset-specific thresholds.
- LLM is sensitive to prompt wording and input quality; tabular handling is weak.
- Explanations may list reasons without faithful quantitative contribution to decisions.
When Not To Use
- As a drop-in replacement for validated clinical diagnostic models in production.
- On raw tabular patient data without input standardization and error checks.
- For high-stakes automated diagnosis without human oversight.
Failure Modes
- False positives from over-interpreting slightly low test scores.
- Missing mismatched cutoffs because GPT-4 may not apply numeric thresholds correctly.
- Inconsistent explanations that do not match final decision.
- Performance inflation on public datasets due to potential training-data leakage.
Core Entities
Models
- GPT-4
- GPT-3.5
- RRL
- Logistic Regression
- CART (Decision Tree)
- Random Forest
- XGBoost
Metrics
- Accuracy
Datasets
- ADNI
- PUMCH (private)

