Interpret chest X‑ray reports by combining concept bottlenecks with a multi‑agent retrieval system

December 20, 20247 min

Overview

Decision SnapshotNeeds Validation

Prototype-stage system: promising accuracy and report gains on one public dataset, validated by LLM judges but lacking clinical user studies and broad external validation.

Citations2

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Shows a practical path to make automated CXR outputs explainable: you can keep high accuracy while surfacing concept evidence and improve report quality with a multi-agent retrieval pipeline.

Who Should Care

Summary TLDR

The paper builds an interpretable chest X‑ray pipeline: a Concept Bottleneck Model (CBM) maps image features to 20 human‑readable clinical concepts, then a multi‑agent Retrieval-Augmented Generation (RAG) system (ReAct agents + Radiologist + Medical Writer) retrieves clinical documents and composes reports. On the COVID-QU set (33,920 images) the CBM reached 81% classification accuracy. Multi‑agent report generation improved LLM-judge metrics (examples: Correctness 0.85 → 0.95, Clinical Usefulness 0.92 → 0.96 for Mistral 7B). The method supports interventions (fixing 3–4 concepts often corrects errors) and trades slightly worse clustering scores for more clinically realistic reports.

Problem Statement

Deep CXR classifiers are accurate but hard to trust because they are black boxes. Automated report generators can be factually inconsistent. The paper aims to make CXR classification and report generation explainable by exposing concept-level evidence and using a multi-agent retrieval and writing pipeline.

Main Contribution

Combine automatic concept discovery CBM with ChexAgent image embeddings to output concept vectors that explain disease predictions.

Introduce a multi-agent RAG system: disease-specific ReAct agents, a Radiologist agent to score concept influence, and a Medical Writer agent to compose reports.

Key Findings

CBM classification accuracy on COVID-QU

Numbers81% accuracy on Covid-QU (Table 1)

Practical UseYou can get competitive accuracy while exposing concept scores for each image, so clinicians can see why the model decided.

Evidence RefTable 1

Multi-agent RAG improves report correctness and clinical usefulness

NumbersCorrectness 0.850.95; Clinical Usefulness 0.920.96 (Mistral 7B, Table 3)

Practical UseUsing multiple specialized agents for retrieval and writing yields more clinically accurate and useful reports than a single-agent RAG.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.81Bio-VIL 0.78+0.03 vs best baselineCOVID-QU testTable 1 lists 0.81 for our model vs 0.78 for Bio-VILTable 1
Report correctness (LLM judge)0.95Single-agent 0.85+0.10Generated reports evaluated by Mistral 7BPaper notes Correctness increased from 0.85 to 0.95 for Mistral 7BTable 3

What To Try In 7 Days

Run a CBM on a sample CXR set to surface concept scores for each prediction.

Prototype a simple ReAct retrieval agent querying a Qdrant index of clinical docs.

Add a lightweight 'concept intervention' step: allow a clinician to correct top 3 concept scores and re-evaluate predictions on misclassified cases.

Agent Features

Memory
retrieval memory via document embeddings
Planning
sequential agent pipeline (retrieve → analyze → write)
Tool Use
vector DB retrieval (Qdrant)LLMs for embedding and judgingVLM for image embeddings (ChexAgent)
Frameworks
CrewAILlamaIndex
Is Agentic

Yes

Architectures
ReAct agent per diseaseRadiologist agentMedical Writer agent
Collaboration
agent-to-agent handoff (ReAct → Radiologist → Writer)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

COVID-QU dataset (Chowdhury et al., 2020, IEEE Access) referenced in paper

Risks & Boundaries

Limitations

Evaluation limited to COVID-QU dataset; generalization to other hospitals is untested.

Report judgments rely on LLMs, which can reflect model biases and not replace clinician evaluation.

When Not To Use

Do not deploy for direct clinical decision-making without clinician validation.

Avoid using this exact pipeline on non-CXR imaging without revalidation.

Failure Modes

Wrong concept extraction leads to incorrect diagnosis and misleading reports.

Retrieval returns noisy or irrelevant documents that drive incorrect explanations.

Core Entities

Models

ChexAgentConcept Bottleneck Model (CBM)CLIPBio-VILLabel-free CBMRobust CBMGPT-4Mistral Embed ModelMistral 7BLlama 3.1Gemma2LLaVAGPT-3.5 TurboDragonfly-MedMedllama2

Metrics

Accuracysemantic similaritycorrectnessclinical usefulnessconsistencySilhouetteDavies-BouldinCalinski-HarabaszDunn

Datasets

COVID-QU (33,920 CXR images)