Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
4
Why It Matters For Business
Automated agent pipelines can cut human prompt-tuning time and reach near-expert accuracy on clinical-note screening, lowering labor cost and speeding deployment in health systems.
Summary TLDR
The authors built a fully automated multi-agent workflow that uses LLaMA 3 8B to screen clinical notes for cognitive concerns. On 3,338 notes from 200 patients, the agentic workflow reached an F1 of 0.91 (prompt-refinement set) with perfect specificity (1.00) after two iterations, matching an expert-driven benchmark (F1 0.90). On an independent validation set both drops in sensitivity were observed (AP2 F1 0.76 vs XP4 F1 0.79). The approach cuts human prompt-tuning steps but can output non-binary or irrelevant replies and is limited to unstructured notes.
Problem Statement
Early cognitive decline is subtle and under-documented in clinical notes. Manual prompt tuning for LLM screening is time-consuming and resource intensive. The paper aims to automate prompt refinement with specialized agents using LLaMA 3 8B to reach expert-level screening accuracy faster and at lower human cost.
Main Contribution
Designed a six-agent automated workflow that iteratively refines prompts and aggregates LLM outputs to label patients for cognitive concerns.
Implemented the workflow with LLaMA 3 8B on 3,338 clinical notes from 200 patients and compared it to an expert-driven prompt-refinement benchmark.
Showed comparable classification performance to expert prompts while requiring fewer iterations (2 vs 4) and achieving perfect specificity on the refinement set.
Reported generalizability limits: both workflows lost sensitivity on an independent validation set and produced non-binary/unexpected outputs in some cases.
Key Findings
Agentic prompt AP2 reached F1-score 0.91 on the prompt-refinement dataset.
AP2 achieved perfect specificity (1.00) and PPV (1.00) on the prompt-refinement set after two iterations.
Expert final prompt XP4 performed nearly equal on refinement (F1 0.90) and better on validation (F1 0.79) than AP2 (F1 0.76).
Agentic workflow required 2 iterations vs clinician workflow 4 iterations to reach stopping criteria.
Both workflows lost sensitivity on the validation set; AP2 sensitivity fell to 0.61 and XP4 to 0.70.
Results
Prompt-refinement F1-score (agentic AP2)
Prompt-refinement specificity (agentic AP2)
Prompt-refinement F1-score (expert XP4)
Validation F1-score (expert XP4)
Validation F1-score (agentic AP2)
Iterations to stop
Who Should Care
What To Try In 7 Days
Run LLaMA 3 8B locally on a small note sample and baseline with P0 ('Is this note indicative...').
Implement 2–3 agent roles: specialist (labeler), evaluator (metrics), and one improver (specificity or sensitivity).
Use the SOP checklist from the paper (keywords, meds, tests) to seed prompt AP2 and measure F1, sensitivity, specificity.
Agent Features
Memory
- Short-term: aggregation of specialist outputs per patient
Planning
- Iterative prompt refinement with threshold-based stopping
- Decision rule: label patient 'positive' if any note is positive
Tool Use
- ChatGPT-4o used for some prompt refinement steps
- Hugging Face weights and local inference stack
Frameworks
- Generated knowledge prompting
- Standard Operating Procedure (SOP) for cognitive concern signals
Is Agentic
true
Architectures
- Single LLM (LLaMA 3 8B) with multi-agent orchestration
Collaboration
- Specialized agents exchange errors and prompt edits (improvers and summarizers)
Optimization Features
Token Efficiency
- Kept temperature low (0.1) and capped output tokens at 256
System Optimization
- Deployed on server with 48 cores and 256 GB RAM
Inference Optimization
- Used LLaMA 3 8B for local, lower-resource inference
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Small clinical cohort (200 patients) limits representativeness.
- Uses only unstructured notes; missing structured data and multimodal signals.
- LLM sometimes produced non-binary or irrelevant outputs that needed exclusion.
- Validation performance dropped, indicating risk of overfitting to refinement set.
- No public code or public EHR data provided for exact replication.
When Not To Use
- When you need guaranteed high sensitivity across diverse sites without local validation.
- When structured EHR fields or multimodal data are required for diagnosis.
- In regulatory settings where black-box LLM outputs and non-binary replies are unacceptable.
Failure Modes
- Non-conforming outputs (not strict yes/no) leading to 'uncertain' exclusions.
- Overfitting prompt edits to the refinement set, reducing sensitivity on new data.
- False negatives when relevant notes are sparse or patient-level aggregation obscures note-level signals.
- Model reliance on risk factors or screening results as proxy evidence without symptom documentation.
Core Entities
Models
- LLaMA 3 8B
Metrics
- sensitivity
- specificity
- F1-score
- PPV
- NPV
- Accuracy
Datasets
- Mass General Brigham clinical notes (3,338 notes; 200 patients; 2016-2018)
Benchmarks
- Expert-driven prompt-refinement workflow (XP1..XP4)

