Multi-agent LLaMA 3 workflow matches expert prompts for detecting cognitive concerns in clinical notes

February 3, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

4

Authors

Jiazi Tian, Liqin Wang, Pedram Fard, Valdery Moura Junior, Deborah Blacker, Jennifer S. Haas, Chirag Patel, Shawn N. Murphy, Lidia M. V. R. Moura, Hossein Estiri

Links

Abstract / PDF

Why It Matters For Business

Automated agent pipelines can cut human prompt-tuning time and reach near-expert accuracy on clinical-note screening, lowering labor cost and speeding deployment in health systems.

Summary TLDR

The authors built a fully automated multi-agent workflow that uses LLaMA 3 8B to screen clinical notes for cognitive concerns. On 3,338 notes from 200 patients, the agentic workflow reached an F1 of 0.91 (prompt-refinement set) with perfect specificity (1.00) after two iterations, matching an expert-driven benchmark (F1 0.90). On an independent validation set both drops in sensitivity were observed (AP2 F1 0.76 vs XP4 F1 0.79). The approach cuts human prompt-tuning steps but can output non-binary or irrelevant replies and is limited to unstructured notes.

Problem Statement

Early cognitive decline is subtle and under-documented in clinical notes. Manual prompt tuning for LLM screening is time-consuming and resource intensive. The paper aims to automate prompt refinement with specialized agents using LLaMA 3 8B to reach expert-level screening accuracy faster and at lower human cost.

Main Contribution

Designed a six-agent automated workflow that iteratively refines prompts and aggregates LLM outputs to label patients for cognitive concerns.

Implemented the workflow with LLaMA 3 8B on 3,338 clinical notes from 200 patients and compared it to an expert-driven prompt-refinement benchmark.

Showed comparable classification performance to expert prompts while requiring fewer iterations (2 vs 4) and achieving perfect specificity on the refinement set.

Reported generalizability limits: both workflows lost sensitivity on an independent validation set and produced non-binary/unexpected outputs in some cases.

Key Findings

Agentic prompt AP2 reached F1-score 0.91 on the prompt-refinement dataset.

NumbersF1 = 0.91 (Table 3)

AP2 achieved perfect specificity (1.00) and PPV (1.00) on the prompt-refinement set after two iterations.

NumbersSpecificity = 1.00; PPV = 1.00 (Table 3)

Expert final prompt XP4 performed nearly equal on refinement (F1 0.90) and better on validation (F1 0.79) than AP2 (F1 0.76).

NumbersRefinement: XP4 F1 = 0.90 (eTable2); Validation: XP4 F1 = 0.79, AP2 F1 = 0.76 (eTable4)

Agentic workflow required 2 iterations vs clinician workflow 4 iterations to reach stopping criteria.

NumbersIterations: agentic = 2; expert-driven = 4 (Results & Discussion)

Both workflows lost sensitivity on the validation set; AP2 sensitivity fell to 0.61 and XP4 to 0.70.

NumbersValidation sensitivity: AP2 = 0.61; XP4 = 0.70 (eTable4)

Results

Prompt-refinement F1-score (agentic AP2)

Value0.91

BaselineP0 F1 = 0.70

Prompt-refinement specificity (agentic AP2)

Value1.00

BaselineP0 specificity = 0.20

Prompt-refinement F1-score (expert XP4)

Value0.90

BaselineP0 F1 = 0.70

Validation F1-score (expert XP4)

Value0.79

BaselineXP4 refinement F1 = 0.90

Validation F1-score (agentic AP2)

Value0.76

BaselineAP2 refinement F1 = 0.91

Iterations to stop

ValueAgentic: 2; Expert-driven: 4

Who Should Care

What To Try In 7 Days

Run LLaMA 3 8B locally on a small note sample and baseline with P0 ('Is this note indicative...').

Implement 2–3 agent roles: specialist (labeler), evaluator (metrics), and one improver (specificity or sensitivity).

Use the SOP checklist from the paper (keywords, meds, tests) to seed prompt AP2 and measure F1, sensitivity, specificity.

Agent Features

Memory

  • Short-term: aggregation of specialist outputs per patient

Planning

  • Iterative prompt refinement with threshold-based stopping
  • Decision rule: label patient 'positive' if any note is positive

Tool Use

  • ChatGPT-4o used for some prompt refinement steps
  • Hugging Face weights and local inference stack

Frameworks

  • Generated knowledge prompting
  • Standard Operating Procedure (SOP) for cognitive concern signals

Is Agentic

true

Architectures

  • Single LLM (LLaMA 3 8B) with multi-agent orchestration

Collaboration

  • Specialized agents exchange errors and prompt edits (improvers and summarizers)

Optimization Features

Token Efficiency

  • Kept temperature low (0.1) and capped output tokens at 256

System Optimization

  • Deployed on server with 48 cores and 256 GB RAM

Inference Optimization

  • Used LLaMA 3 8B for local, lower-resource inference

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small clinical cohort (200 patients) limits representativeness.
  • Uses only unstructured notes; missing structured data and multimodal signals.
  • LLM sometimes produced non-binary or irrelevant outputs that needed exclusion.
  • Validation performance dropped, indicating risk of overfitting to refinement set.
  • No public code or public EHR data provided for exact replication.

When Not To Use

  • When you need guaranteed high sensitivity across diverse sites without local validation.
  • When structured EHR fields or multimodal data are required for diagnosis.
  • In regulatory settings where black-box LLM outputs and non-binary replies are unacceptable.

Failure Modes

  • Non-conforming outputs (not strict yes/no) leading to 'uncertain' exclusions.
  • Overfitting prompt edits to the refinement set, reducing sensitivity on new data.
  • False negatives when relevant notes are sparse or patient-level aggregation obscures note-level signals.
  • Model reliance on risk factors or screening results as proxy evidence without symptom documentation.

Core Entities

Models

  • LLaMA 3 8B

Metrics

  • sensitivity
  • specificity
  • F1-score
  • PPV
  • NPV
  • Accuracy

Datasets

  • Mass General Brigham clinical notes (3,338 notes; 200 patients; 2016-2018)

Benchmarks

  • Expert-driven prompt-refinement workflow (XP1..XP4)