Interpret chest X‑ray reports by combining concept bottlenecks with a multi‑agent retrieval system

December 20, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

2

Authors

Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag

Links

Abstract / PDF

Why It Matters For Business

Shows a practical path to make automated CXR outputs explainable: you can keep high accuracy while surfacing concept evidence and improve report quality with a multi-agent retrieval pipeline.

Summary TLDR

The paper builds an interpretable chest X‑ray pipeline: a Concept Bottleneck Model (CBM) maps image features to 20 human‑readable clinical concepts, then a multi‑agent Retrieval-Augmented Generation (RAG) system (ReAct agents + Radiologist + Medical Writer) retrieves clinical documents and composes reports. On the COVID-QU set (33,920 images) the CBM reached 81% classification accuracy. Multi‑agent report generation improved LLM-judge metrics (examples: Correctness 0.85 → 0.95, Clinical Usefulness 0.92 → 0.96 for Mistral 7B). The method supports interventions (fixing 3–4 concepts often corrects errors) and trades slightly worse clustering scores for more clinically realistic reports.

Problem Statement

Deep CXR classifiers are accurate but hard to trust because they are black boxes. Automated report generators can be factually inconsistent. The paper aims to make CXR classification and report generation explainable by exposing concept-level evidence and using a multi-agent retrieval and writing pipeline.

Main Contribution

Combine automatic concept discovery CBM with ChexAgent image embeddings to output concept vectors that explain disease predictions.

Introduce a multi-agent RAG system: disease-specific ReAct agents, a Radiologist agent to score concept influence, and a Medical Writer agent to compose reports.

Evaluate on COVID-QU (33,920 images): 81% classification accuracy and LLM-judge gains in report correctness and clinical usefulness versus single-agent baselines.

Show concept interventions: correcting 3–4 top contributing concepts often fixes misclassifications, demonstrating actionable interpretability.

Open-source intent: authors state code will be released (GitHub link provided in paper).

Key Findings

CBM classification accuracy on COVID-QU

Numbers81% accuracy on Covid-QU (Table 1)

Multi-agent RAG improves report correctness and clinical usefulness

NumbersCorrectness 0.85 → 0.95; Clinical Usefulness 0.92 → 0.96 (Mistral 7B, Table 3)

Concept intervention corrects many misclassifications

NumbersFixing 3–4 top concepts yields a significant performance increase (Fig. 3b)

Results

Accuracy

Value0.81

BaselineBio-VIL 0.78

Report correctness (LLM judge)

Value0.95

BaselineSingle-agent 0.85

Clinical usefulness (LLM judge)

Value0.96

BaselineSingle-agent 0.92

Clustering quality (Silhouette)

Value0.27

BaselineSingle Agent 0.41

Who Should Care

What To Try In 7 Days

Run a CBM on a sample CXR set to surface concept scores for each prediction.

Prototype a simple ReAct retrieval agent querying a Qdrant index of clinical docs.

Add a lightweight 'concept intervention' step: allow a clinician to correct top 3 concept scores and re-evaluate predictions on misclassified cases.

Agent Features

Memory

  • retrieval memory via document embeddings

Planning

  • sequential agent pipeline (retrieve → analyze → write)

Tool Use

  • vector DB retrieval (Qdrant)
  • LLMs for embedding and judging
  • VLM for image embeddings (ChexAgent)

Frameworks

  • CrewAI
  • LlamaIndex

Is Agentic

true

Architectures

  • ReAct agent per disease
  • Radiologist agent
  • Medical Writer agent

Collaboration

  • agent-to-agent handoff (ReAct → Radiologist → Writer)

Reproducibility

Data Urls

  • COVID-QU dataset (Chowdhury et al., 2020, IEEE Access) referenced in paper

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation limited to COVID-QU dataset; generalization to other hospitals is untested.
  • Report judgments rely on LLMs, which can reflect model biases and not replace clinician evaluation.
  • Concept discovery depends on GPT-4 prompts and choices of descriptors, which may vary.
  • No human clinician study reported for real-world workflow integration.

When Not To Use

  • Do not deploy for direct clinical decision-making without clinician validation.
  • Avoid using this exact pipeline on non-CXR imaging without revalidation.
  • Not suited for real-time triage if low-latency constraints exist due to multi-agent retrieval.

Failure Modes

  • Wrong concept extraction leads to incorrect diagnosis and misleading reports.
  • Retrieval returns noisy or irrelevant documents that drive incorrect explanations.
  • LLM-based judge overestimates report quality or misses subtle clinical errors.

Core Entities

Models

  • ChexAgent
  • Concept Bottleneck Model (CBM)
  • CLIP
  • Bio-VIL
  • Label-free CBM
  • Robust CBM
  • GPT-4
  • Mistral Embed Model
  • Mistral 7B
  • Llama 3.1
  • Gemma2
  • LLaVA
  • GPT-3.5 Turbo
  • Dragonfly-Med
  • Medllama2

Metrics

  • Accuracy
  • semantic similarity
  • correctness
  • clinical usefulness
  • consistency
  • Silhouette
  • Davies-Bouldin
  • Calinski-Harabasz
  • Dunn

Datasets

  • COVID-QU (33,920 CXR images)