Multi-agent LLaMA 3 workflow matches expert prompts for detecting cognitive concerns in clinical notes

February 3, 20258 min

Overview

Decision SnapshotNeeds Validation

The workflow shows practical gains in iteration reduction and specificity on the studied set, but sensitivity and generalizability decline on held-out data; test locally and monitor uncertain and false-negative rates before production.

Citations4

Evidence Strength0.70

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Jiazi Tian, Liqin Wang, Pedram Fard, Valdery Moura Junior, Deborah Blacker, Jennifer S. Haas, Chirag Patel, Shawn N. Murphy, Lidia M. V. R. Moura, Hossein Estiri

Links

Abstract / PDF

Why It Matters For Business

Automated agent pipelines can cut human prompt-tuning time and reach near-expert accuracy on clinical-note screening, lowering labor cost and speeding deployment in health systems.

Who Should Care

Summary TLDR

The authors built a fully automated multi-agent workflow that uses LLaMA 3 8B to screen clinical notes for cognitive concerns. On 3,338 notes from 200 patients, the agentic workflow reached an F1 of 0.91 (prompt-refinement set) with perfect specificity (1.00) after two iterations, matching an expert-driven benchmark (F1 0.90). On an independent validation set both drops in sensitivity were observed (AP2 F1 0.76 vs XP4 F1 0.79). The approach cuts human prompt-tuning steps but can output non-binary or irrelevant replies and is limited to unstructured notes.

Problem Statement

Early cognitive decline is subtle and under-documented in clinical notes. Manual prompt tuning for LLM screening is time-consuming and resource intensive. The paper aims to automate prompt refinement with specialized agents using LLaMA 3 8B to reach expert-level screening accuracy faster and at lower human cost.

Main Contribution

Designed a six-agent automated workflow that iteratively refines prompts and aggregates LLM outputs to label patients for cognitive concerns.

Implemented the workflow with LLaMA 3 8B on 3,338 clinical notes from 200 patients and compared it to an expert-driven prompt-refinement benchmark.

Key Findings

Agentic prompt AP2 reached F1-score 0.91 on the prompt-refinement dataset.

NumbersF1 = 0.91 (Table 3)

Practical UseYou can reach expert-level balanced performance quickly by using the agentic prompt refinement loop (use AP2 as a starting point).

Evidence RefTable 3

AP2 achieved perfect specificity (1.00) and PPV (1.00) on the prompt-refinement set after two iterations.

NumbersSpecificity = 1.00; PPV = 1.00 (Table 3)

Practical UseUse agent-driven specificity improvement when false positives are costly; it eliminated false positives in the studied set.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Prompt-refinement F1-score (agentic AP2)0.91P0 F1 = 0.70+0.21prompt-refinement set (2,228 notes)Table 3 shows AP2 F1 = 0.91 vs P0 F1 = 0.70Table 3
Prompt-refinement specificity (agentic AP2)1.00P0 specificity = 0.20+0.80prompt-refinement setAP2 specificity rose to 1.00 after specificity improver (Table 3)Table 3

What To Try In 7 Days

Run LLaMA 3 8B locally on a small note sample and baseline with P0 ('Is this note indicative...').

Implement 2–3 agent roles: specialist (labeler), evaluator (metrics), and one improver (specificity or sensitivity).

Use the SOP checklist from the paper (keywords, meds, tests) to seed prompt AP2 and measure F1, sensitivity, specificity.

Agent Features

Memory
Short-term: aggregation of specialist outputs per patient
Planning
Iterative prompt refinement with threshold-based stoppingDecision rule: label patient 'positive' if any note is positive
Tool Use
ChatGPT-4o used for some prompt refinement stepsHugging Face weights and local inference stack
Frameworks
Generated knowledge promptingStandard Operating Procedure (SOP) for cognitive concern signals
Is Agentic

Yes

Architectures
Single LLM (LLaMA 3 8B) with multi-agent orchestration
Collaboration
Specialized agents exchange errors and prompt edits (improvers and summarizers)

Optimization Features

Token Efficiency
Kept temperature low (0.1) and capped output tokens at 256
System Optimization
Deployed on server with 48 cores and 256 GB RAM
Inference Optimization
Used LLaMA 3 8B for local, lower-resource inference

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Small clinical cohort (200 patients) limits representativeness.

Uses only unstructured notes; missing structured data and multimodal signals.

When Not To Use

When you need guaranteed high sensitivity across diverse sites without local validation.

When structured EHR fields or multimodal data are required for diagnosis.

Failure Modes

Non-conforming outputs (not strict yes/no) leading to 'uncertain' exclusions.

Overfitting prompt edits to the refinement set, reducing sensitivity on new data.

Core Entities

Models

LLaMA 3 8B

Metrics

sensitivityspecificityF1-scorePPVNPVAccuracy

Datasets

Mass General Brigham clinical notes (3,338 notes; 200 patients; 2016-2018)

Benchmarks

Expert-driven prompt-refinement workflow (XP1..XP4)