Multi-agent LLaMA 3 workflow matches expert prompts for detecting cognitive concerns in clinical notes

Overview

Decision SnapshotNeeds Validation

The workflow shows practical gains in iteration reduction and specificity on the studied set, but sensitivity and generalizability decline on held-out data; test locally and monitor uncertain and false-negative rates before production.

Citations4

Evidence Strength0.70

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Jiazi Tian, Liqin Wang, Pedram Fard, Valdery Moura Junior, Deborah Blacker, Jennifer S. Haas, Chirag Patel, Shawn N. Murphy, Lidia M. V. R. Moura, Hossein Estiri

Links

Abstract / PDF

Why It Matters For Business

Automated agent pipelines can cut human prompt-tuning time and reach near-expert accuracy on clinical-note screening, lowering labor cost and speeding deployment in health systems.

Who Should Care

ML Engineer Data Scientist Engineering Lead CTO Product Manager

Summary TLDR

The authors built a fully automated multi-agent workflow that uses LLaMA 3 8B to screen clinical notes for cognitive concerns. On 3,338 notes from 200 patients, the agentic workflow reached an F1 of 0.91 (prompt-refinement set) with perfect specificity (1.00) after two iterations, matching an expert-driven benchmark (F1 0.90). On an independent validation set both drops in sensitivity were observed (AP2 F1 0.76 vs XP4 F1 0.79). The approach cuts human prompt-tuning steps but can output non-binary or irrelevant replies and is limited to unstructured notes.

Problem Statement

Early cognitive decline is subtle and under-documented in clinical notes. Manual prompt tuning for LLM screening is time-consuming and resource intensive. The paper aims to automate prompt refinement with specialized agents using LLaMA 3 8B to reach expert-level screening accuracy faster and at lower human cost.

Main Contribution

Designed a six-agent automated workflow that iteratively refines prompts and aggregates LLM outputs to label patients for cognitive concerns.

Implemented the workflow with LLaMA 3 8B on 3,338 clinical notes from 200 patients and compared it to an expert-driven prompt-refinement benchmark.

Key Findings

Agentic prompt AP2 reached F1-score 0.91 on the prompt-refinement dataset.

NumbersF1 = 0.91 (Table 3)

Practical UseYou can reach expert-level balanced performance quickly by using the agentic prompt refinement loop (use AP2 as a starting point).

Evidence RefTable 3

AP2 achieved perfect specificity (1.00) and PPV (1.00) on the prompt-refinement set after two iterations.

NumbersSpecificity = 1.00; PPV = 1.00 (Table 3)

Practical UseUse agent-driven specificity improvement when false positives are costly; it eliminated false positives in the studied set.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Prompt-refinement F1-score (agentic AP2)	0.91	P0 F1 = 0.70	+0.21	prompt-refinement set (2,228 notes)	Table 3 shows AP2 F1 = 0.91 vs P0 F1 = 0.70	Table 3
Prompt-refinement specificity (agentic AP2)	1.00	P0 specificity = 0.20	+0.80	prompt-refinement set	AP2 specificity rose to 1.00 after specificity improver (Table 3)	Table 3

What To Try In 7 Days

Run LLaMA 3 8B locally on a small note sample and baseline with P0 ('Is this note indicative...').

Implement 2–3 agent roles: specialist (labeler), evaluator (metrics), and one improver (specificity or sensitivity).

Use the SOP checklist from the paper (keywords, meds, tests) to seed prompt AP2 and measure F1, sensitivity, specificity.

Agent Features

Memory

Short-term: aggregation of specialist outputs per patient

Planning

Iterative prompt refinement with threshold-based stoppingDecision rule: label patient 'positive' if any note is positive

Tool Use

ChatGPT-4o used for some prompt refinement stepsHugging Face weights and local inference stack

Frameworks

Generated knowledge promptingStandard Operating Procedure (SOP) for cognitive concern signals

Is Agentic

Yes

Architectures

Single LLM (LLaMA 3 8B) with multi-agent orchestration

Collaboration

Specialized agents exchange errors and prompt edits (improvers and summarizers)

Optimization Features

Token Efficiency

Kept temperature low (0.1) and capped output tokens at 256

System Optimization

Deployed on server with 48 cores and 256 GB RAM

Inference Optimization

Used LLaMA 3 8B for local, lower-resource inference

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Small clinical cohort (200 patients) limits representativeness.

Uses only unstructured notes; missing structured data and multimodal signals.

When Not To Use

When you need guaranteed high sensitivity across diverse sites without local validation.

When structured EHR fields or multimodal data are required for diagnosis.

Failure Modes

Non-conforming outputs (not strict yes/no) leading to 'uncertain' exclusions.

Overfitting prompt edits to the refinement set, reducing sensitivity on new data.

Core Entities

Models

LLaMA 3 8B

Metrics

sensitivityspecificityF1-scorePPVNPVAccuracy

Datasets

Mass General Brigham clinical notes (3,338 notes; 200 patients; 2016-2018)

Benchmarks

Expert-driven prompt-refinement workflow (XP1..XP4)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Agentic prompt AP2 reached F1-score 0.91 on the prompt-refinement dataset.

AP2 achieved perfect specificity (1.00) and PPV (1.00) on the prompt-refinement set after two iterations.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding