Multi-agent LLM pipeline that auto-generates themes from clinical transcripts and optionally adapts with RLHF

June 30, 20258 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Seungjun Yi, Joakim Nguyen, Huimin Xu, Terence Lim, Andrew Well, Mia Markey, Ying Ding

Links

Abstract / PDF

Why It Matters For Business

Auto-TA can turn large sets of interview transcripts into actionable themes quickly. That lets health services, product teams, and research groups scale qualitative analysis without hiring proportional human coders. However, you must validate output quality and watch for domain drift.

Summary TLDR

Auto-TA is a multi-agent LLM pipeline that converts clinical interview transcripts into codified themes without manual coding. Role-conditioned agents (coder, theme-generator, feedback) run end-to-end and can optionally use RLHF (PPO + binary human rewards) to tune theme quality. On a 9-transcript AAOCA subset, identity-conditioned agents raised credibility scores by about 11–16 points versus a no-identity baseline, but standard surface metrics remain low and the system is sensitive to prompts and domain.

Problem Statement

Manual thematic analysis of clinical narratives is slow, costly, and hard to scale. Existing LLM approaches often still require full human transcript review. The paper asks: can we fully automate end-to-end thematic analysis to scale qualitative insights for clinical use?

Main Contribution

Auto-TA: an end-to-end multi-agent LLM pipeline that generates codes and themes from unstructured clinical transcripts without manual coding.

Identity-conditioned multi-agent design: coder agents with domain personas (e.g., cardiac surgeon, researcher, layperson) plus theme-generation and feedback agents to refine outputs.

Optional RLHF integration: a PPO-based loop using binary human rewards and a reward model to adapt theme generation toward human preferences (implementation in progress).

Key Findings

Assigning domain identities to coder agents substantially improved credibility scores.

NumbersCredibility baseline 82.13 → Cardiac Surgeon 98.41 (+16.28)

Auto-TA runs quickly on medium transcripts.

NumbersUnder 10 minutes per ~10k-word transcript (authors' claim)

Standard surface metrics show low alignment even when themes are meaningfully related.

NumbersBidirectional cosine baseline 0.132 ±0.027; many identity agents lower (e.g., Cardiac Surgeon 0.115)

Transferability and dependability are modest but usable across this dataset.

NumbersDependability D baseline 0.400; Transferability T baseline 0.308; identity gains up to +0.027 in T

RLHF is supported by design but not fully deployed in this study.

NumbersRLHF implementation described as 'in progress'; reward scheme uses binary human labels with PPO

Results

Credibility (C)

ValueBaseline 82.13 ± 18.96; Cardiac Surgeon 98.41 ± 4.76

Baseline82.13 ± 18.96

Dependability (D)

ValueBaseline 0.400 ± 0.017; Cardiac Surgeon 0.395 ± 0.019

Baseline0.400 ± 0.017

Transferability (T)

ValueBaseline 0.308 ± 0.018; Medical Doctor 0.334 ± 0.007 (+0.026)

Baseline0.308 ± 0.018

Bidirectional Cosine Similarity (C_bi)

ValueBaseline 0.132 ± 0.027; Cardiac Surgeon 0.115 ± 0.053

Baseline0.132 ± 0.027

Runtime claim

ValueUnder 10 minutes per 10k-word transcript

Who Should Care

What To Try In 7 Days

Run a 1–3 transcript pilot: spawn 4 role-conditioned agents (e.g., clinician, researcher, layperson, vanilla) and compare themes to a human-coded reference.

Measure throughput: time per 10k words and cost of API calls to estimate operational cost.

Add a feedback agent that enforces quote traceability (Quote IDs) to catch hallucinations early.

Agent Features

Memory

  • chunked transcript broadcasting (no long-term retriever memory)
  • audit trail traceability via Quote IDs

Planning

  • iterative refinement loop with heuristic edits
  • optional PPO-based policy updates for theme generator

Tool Use

  • PPO (for RLHF)
  • sentence embeddings for semantic alignment
  • heuristic edit rules in feedback loop

Frameworks

  • Auto-TA (this work)
  • AutoGen and CAMEL referenced as enabling frameworks

Is Agentic

true

Architectures

  • multi-agent LLM pipeline
  • role-conditioned agents (coder, theme-generator, feedback)

Collaboration

  • specialized agent roles collaborate via staged pipeline
  • feedback agent acts as critic; agents do not directly interact yet

Optimization Features

Token Efficiency

  • chunking transcripts into <=1500-unit batches to fit model input limits

System Optimization

  • parallel role-conditioned agents to reduce wall-clock time

Training Optimization

  • SFT

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluation limited to a small AAOCA subset (9 transcripts, 42 parents); generalizability to other domains is untested.
  • RLHF pipeline described but implementation is 'in progress'—claims about adaptive improvements are theoretical here.
  • Agents do not interact or negotiate directly; no multi-agent dialogues beyond staged pipeline.
  • High sensitivity to prompt wording; small prompt edits can change outputs and hurt reproducibility.
  • Alignment assessment relies on a single human-coded reference; multiple valid thematic interpretations exist.

When Not To Use

  • For single-case clinical decision-making that requires human adjudication.
  • When strict, auditable provenance and regulatory compliance require open-source toolchains (no code release noted).
  • On domains very different from cardiac family narratives without revalidation.

Failure Modes

  • Hallucinated themes not grounded in quotes despite high semantic plausibility.
  • Identity prompts biasing outputs toward particular perspectives and missing alternative themes.
  • Low surface-metric scores masking reasonable but differently-worded themes.
  • Instability across runs if prompts or agent identities change.

Core Entities

Models

  • GPT-4o (role-conditioned agents)
  • all-mpnet-base-v2 (sentence embeddings for cosine similarity)

Metrics

  • Credibility (C) - quote-grounding overlap
  • Dependability (D) - ROUGE-based inter-run overlap
  • Transferability (T) - ROUGE-based train/val overlap
  • Bidirectional Cosine Similarity (C_bi)
  • Levenshtein similarity (D_L)
  • BLEU (B)
  • ROUGE-1/ROUGE-2

Datasets

  • AAOCA subset (9 transcripts, 42 parents, avg 10,987 words)
  • Parent SV-CHD corpus (full: 58 sessions, ~520k words) referenced

Context Entities

Models

  • GPT-3.5/GPT-style agents (referenced related work)

Datasets

  • Mery et al. (2023) human-generated themes (used as reference)